Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
MongoDB at Sailthru                         Scaling and Schema                                Design                      ...
Sailthru                    • API-based transactional email led to...                    • Mass campaign email led to...  ...
How We Got To                         MongoDB from SQL                    • JSON was part of Sailthru infrastructure      ...
Sailthru Architecture                    • User interface to display stats, build                         campaigns and te...
MongoDB Overview                    • 13 instances on EC2 (6 two-member                         replica sets, 1 backup ser...
Users are Documents                    • Users aren’t records split among multiple                         tables         ...
Profiles Accessible                             Everywhere                    • Put abandoned shopping cart notifications   ...
Profiles Accessible                             Everywhere                    • Show a section of content conditional on   ...
Profiles Accessible                                 Everywhere                    • Show different content depending on use...
Profiles Accessible                                 Everywhere                    • Pick top content from a data feed based...
Other Advantages of                             MongoDB                    • High performance                    • Take an...
How We Run mongod                    •    mongod --dbpath /path/to/db --logpath /path/to/log/                         mong...
Separate DBs By                              Collections                    • Lower-effort than auto-sharding             ...
Our Five Replica Sets                    • main: most of the stuff on the UI, lots of                         small/medium...
Monitoring                    • Some stuff to monitor: faults/sec, index                         misses, % locked, queue s...
Backups                    • Used to use mongodump - don’t do that                         anymore                    • Ha...
The Great EC2 EBS                         Outage Adventure                    • We survived                    • Most of o...
DESIGNSunday, August 7, 2011
Develop Your Mental                         Model of MongoDB                    • You don’t need to look at the internals ...
Big-Picture Design                              Questions                    • What is the data I want to store?          ...
“But premature                         optimization is evil”                    • Knuth said that about code, which is    ...
Specific MongoDB                         Design Questions                    • Embed vs top-level collection?              ...
Short Field Names?                    • Disk space: cheap                    • RAM: not cheap                    • Develop...
Favor Human-Readable                       Foreign Keys                    • DBRefs are a bit cumbersome                  ...
Example                    • Store the Template and the Email as strings                         on the message object    ...
Embed vs Top-Level                           Collections?                    • Major question of MongoDB schema design    ...
Typical Properties of                         Top-Level Collections                    • Independence: They don’t “belong”...
Embedding Pros                    • Super-fast retrieval of document with                         related data            ...
Embedding Cons                    • Harder to get at, do mass queries                    • Does not size up infinitely, wil...
If You Think You Can                                 Embed                    • You probably should                    • I...
Design Example:                           User Permissions                    • Users can have various broad permission   ...
How Will We Use This                          Data?                    • Retrieve all clients for a given user            ...
How Will This Data                             Grow?                    • In the medium term, it will stay small          ...
Back in SQL-land                    • There’s a fairly standard way to do it                    • It’s a many-many relatio...
Should We Use a New                    Top-Level Collection?                         db.client.user.save( {               ...
Probably Not                    • Only needed if we have lots of clients per                         user AND lots of user...
Three Ways to Embed                         ‘clients’: {                            ‘76’: ‘admin’,                        ...
Indexes                    • Index all highly frequent queries                    • Do less-indexed queries only on       ...
Take Advantage of                         Multiple-Field Indexes                    • Order matters                    • I...
Use your _id                    • You must use an _id for every collection,                         which will cost you in...
Take advantage of fast                               ^indexes                    • Messages have _ids like: 32423.00000341...
Manual Range                                   Partioning                    • We moved a big message.blast collection    ...
Questions?                         Looking for a job?                              ian@sailthru.com                       ...
Upcoming SlideShare
Loading in …5
×

MongoDB at Sailthru: Scaling and Schema Design

1,578 views

Published on

Sailthru provides all your website email delivery needs, ensuring Inbox delivery for transactional and mass mail. Sailthru started out as a MySQL-powered transactional-mail service. Starting in 2009, we migrated to the document-oriented "nosql" database MongoDB. Moving entirely to MongoDB has allowed us to build complex user profiles to power behavioral-targeted mass emails and onsite recommendations. How and why we made the move, and how we use MongoDB today.

Published in: Technology

MongoDB at Sailthru: Scaling and Schema Design

  1. 1. MongoDB at Sailthru Scaling and Schema Design Ian White @eonwhite NoSQL Now! 8/25/11Sunday, August 7, 2011
  2. 2. Sailthru • API-based transactional email led to... • Mass campaign email led to... • Intelligence and user behavior • Three engineers built the ESP we always wanted to use • Some Clients: Huffpo-AOL, Thrillist, Refinery 29, Flavorpill, Business Insider, Fab, Totsy, New York ObserverSunday, August 7, 2011
  3. 3. How We Got To MongoDB from SQL • JSON was part of Sailthru infrastructure from start (SQL columns and S3) • Kept a close eye on CouchDB project • MongoDB felt like natural fit • Used for user profiles and analytics initially • Migrated one table at a time (very, very carefully)Sunday, August 7, 2011
  4. 4. Sailthru Architecture • User interface to display stats, build campaigns and templates, etc (PHP/EC2) • API, link rewriting, and onsite endpoints (PHP/EC2) • Core mailer engine (Java/EC2 and colo) • Modified-postfix SMTP servers (colo) • 11 database servers on EC2 (for now)Sunday, August 7, 2011
  5. 5. MongoDB Overview • 13 instances on EC2 (6 two-member replica sets, 1 backup server) • About 40 collections • About 1TB • Largest single collection is 500m docsSunday, August 7, 2011
  6. 6. Users are Documents • Users aren’t records split among multiple tables • End user’s lists, clickstream interests, geolocation, browser, time of day, purchase history becomes one ever-growing documentSunday, August 7, 2011
  7. 7. Profiles Accessible Everywhere • Put abandoned shopping cart notifications within a mass email {if profile.purchase_incomplete} <p>This is what’s in your cart:</p> {foreach profile.purchase_incomplete.items as item} {item.qty} <a href=”{item.url}”>{item.title}</a><br/> {/foreach} {/if}Sunday, August 7, 2011
  8. 8. Profiles Accessible Everywhere • Show a section of content conditional on the user’s location {if profile.geo.city[‘New York, NY US’]} <div>Come to the New York Meetup on the 27th!</div> {/if}Sunday, August 7, 2011
  9. 9. Profiles Accessible Everywhere • Show different content depending on user interests as measured by on-site behavior {select} {case horizon_interest(black,dark)} <img src="http://example.com/dress-image-black.jpg" /> {/case} {case horizon_interest(green)} <img src="http://example.com/dress-image-green.jpg" /> {/case} {case horizon_interest(purple,polka_dot,pattern)} <img src="http://example.com/dress-image-polkadot.jpg" /> {/case} {/select}Sunday, August 7, 2011
  10. 10. Profiles Accessible Everywhere • Pick top content from a data feed based on tags {content = horizon_select(content,10)} {foreach content as c} <a href=”{c.url}”>{c.title}</a><br/> {/foreach}Sunday, August 7, 2011
  11. 11. Other Advantages of MongoDB • High performance • Take any parameters from our clients • Really flexible development • Great for analytics (internal and external) • No more downtime for schema migrations or reindexingSunday, August 7, 2011
  12. 12. How We Run mongod • mongod --dbpath /path/to/db --logpath /path/to/log/ mongodb.log --logappend --fork --rest --replSet main1 --journal • Don’t ever run without replication • Don’t ever kill -9 • Don’t run without writing to a log • Run behind a firewall • Use journaling now that it’s there • Use --rest, it’s handySunday, August 7, 2011
  13. 13. Separate DBs By Collections • Lower-effort than auto-sharding • Separate databases for different usage patterns • Consider consequences of database failure/ unavailability • But make sure your backup and monitoring strategy is prepared for multiple DBsSunday, August 7, 2011
  14. 14. Our Five Replica Sets • main: most of the stuff on the UI, lots of small/medium collections • horizon: realtime onsite browsing data • profile: user profile data (60m user docs) • message: last three months of emails • archive: emails older than three monthsSunday, August 7, 2011
  15. 15. Monitoring • Some stuff to monitor: faults/sec, index misses, % locked, queue size, load average • we check basic status once/minute on all database servers (SMS alerts if down), email warnings on thresholds every 10 minutes • have been beta-ing 10gen’s MMS productSunday, August 7, 2011
  16. 16. Backups • Used to use mongodump - don’t do that anymore • Have single node of each replica set on a backup server • Two-hour slave delay • fsync/lock, freeze xfs file system, EBS snapshot, unfreeze, unlockSunday, August 7, 2011
  17. 17. The Great EC2 EBS Outage Adventure • We survived • Most of our nodes unavailable for 2-4 days • Were able to spin up new instances from backup server, snapshots, and get operational within hours • Wasn’t funSunday, August 7, 2011
  18. 18. DESIGNSunday, August 7, 2011
  19. 19. Develop Your Mental Model of MongoDB • You don’t need to look at the internals • But try to gain a working understanding of how MongoDB operates, especially RAM and indexesSunday, August 7, 2011
  20. 20. Big-Picture Design Questions • What is the data I want to store? • How will I want to use that data later? • How big will the data get? • If the answers are “I don’t know yet”, guess with your best YAGNISunday, August 7, 2011
  21. 21. “But premature optimization is evil” • Knuth said that about code, which is flexible and easy to optimize later • Data is not as flexible as code • So doing some planning for performance is usually good when it comes to your dataSunday, August 7, 2011
  22. 22. Specific MongoDB Design Questions • Embed vs top-level collection? • Denormalize (double-store data)? • How many/which indexes? • Arrays vs hashes for embedding? • Implicit schema (field names and types)Sunday, August 7, 2011
  23. 23. Short Field Names? • Disk space: cheap • RAM: not cheap • Developer Time: expensive • Err towards compact, readable fieldnames • Might be worth writing a mapper • Probably wish we’d used c instead of client_idSunday, August 7, 2011
  24. 24. Favor Human-Readable Foreign Keys • DBRefs are a bit cumbersome • Referencing by MongoId often means doing extra lookups • Build human-readable references to save you doing lookups and manual joinsSunday, August 7, 2011
  25. 25. Example • Store the Template and the Email as strings on the message object • { template: “Internal - Blast Notify”, email: “support-alerts@sailthru.com” } • No external reference lookups required • The tradeoff is basically just disk spaceSunday, August 7, 2011
  26. 26. Embed vs Top-Level Collections? • Major question of MongoDB schema design • If you can ask the question at all, you might want to err on the side of embedding • Don’t embed if the embedding could get huge • Don’t feel too bad about denormalizing by embedding AND storing in a top-level collectionSunday, August 7, 2011
  27. 27. Typical Properties of Top-Level Collections • Independence: They don’t “belong” conceptually to another collection • Nouns: the building blocks of your system • Easily referenceable and updatableSunday, August 7, 2011
  28. 28. Embedding Pros • Super-fast retrieval of document with related data • Atomic updates • “Ownership” of embedded document is obvious • Usually maps well to code structuresSunday, August 7, 2011
  29. 29. Embedding Cons • Harder to get at, do mass queries • Does not size up infinitely, will hit 16MB limit • Hard to create references to embedded object • Limited ability to indexed-sort the embedded objectsSunday, August 7, 2011
  30. 30. If You Think You Can Embed • You probably should • I take advantage of embedding in my designs more often now than I did three years ago • It’s a gift MongoDB gives you in exchange for giving up your joinsSunday, August 7, 2011
  31. 31. Design Example: User Permissions • Users can have various broad permission levels for any number of clients • For example, user ‘ploki’ might have permission level ‘admin’ for client 76 and permission level ‘reports_only’ for client 450Sunday, August 7, 2011
  32. 32. How Will We Use This Data? • Retrieve all clients for a given user • Retrieve all users for a given client • Retrieve a permission level for a given client for a given userSunday, August 7, 2011
  33. 33. How Will This Data Grow? • In the medium term, it will stay small • Number of clients and number of users can both grow infinitelySunday, August 7, 2011
  34. 34. Back in SQL-land • There’s a fairly standard way to do it • It’s a many-many relationship, so • Use a join table (client_user)Sunday, August 7, 2011
  35. 35. Should We Use a New Top-Level Collection? db.client.user.save( { client_id: 76, username: ‘ploki’, permission: ‘admin’, }); db.client.user.save( { client_id: 450, username: ‘ploki’, permission: ‘reports_only’, }); db.client.user.ensureIndex( { client_id: 1 } ); db.client.user.ensureIndex( { username: 1 } ); // get all users belonging to a client db.client.user.find( { client_id: 76 } ); // get all clients a user has access to db.client.user.find( { username: ‘ibwhite’ } ); // get permissions for our current user db.client.user.findOne( { username: user.name } );Sunday, August 7, 2011
  36. 36. Probably Not • Only needed if we have lots of clients per user AND lots of users per client • This is a case where we can embed, so let’s do soSunday, August 7, 2011
  37. 37. Three Ways to Embed ‘clients’: { ‘76’: ‘admin’, Not good: Object ‘450’: ‘reports_only’, can’t do a multikeys index }, on the keys of a hash index:??? Okay: Array ‘clients’: [ {‘_id’: 76, ‘access’: ‘admin’}, but have to search through array of objects }, {‘_id’: 450, ‘access’: ‘reports_only’} to find by _id index: { ‘clients._id’: 1 } on retrieved doc ‘clients’: [ 76, 450 ], Our approach: Array ‘clients_access’: { ’76’: ‘admin’, Fields next to each other alphabetically and object ‘450’: ‘reports_only’, } index: { clients: 1 }Sunday, August 7, 2011
  38. 38. Indexes • Index all highly frequent queries • Do less-indexed queries only on secondaries • Reduce the size of indexes whereever you can on big collections • Don’t sweat the medium-sized collections, focus on the big winsSunday, August 7, 2011
  39. 39. Take Advantage of Multiple-Field Indexes • Order matters • If you have an index on {client_id: 1, email: 1 } • Then you also have the {client_id: 1} index “for free” • but not { email: 1}Sunday, August 7, 2011
  40. 40. Use your _id • You must use an _id for every collection, which will cost you index size • So do something useful with _idSunday, August 7, 2011
  41. 41. Take advantage of fast ^indexes • Messages have _ids like: 32423.00000341 • Need all messages in blast 32423: • db.message.blast.find( { _id: /^32423./ } ); • (Yeah, I know the . is ugly. Don’t use a dot if you do this.)Sunday, August 7, 2011
  42. 42. Manual Range Partioning • We moved a big message.blast collection into per-day collections: • message.blast.20110605 message.blast.20110606 message.blast.20110607 etc... • Keeps working set indexes smaller • When we move data into the archive, drop() is much faster than remove()Sunday, August 7, 2011
  43. 43. Questions? Looking for a job? ian@sailthru.com twitter.com/eonwhiteSunday, August 7, 2011

×