Flowdock's full-text search with MongoDB
Upcoming SlideShare
Loading in...5
×
 

Flowdock's full-text search with MongoDB

on

  • 3,767 views

Otto Hilska's presentation about Flowdock's full-text search with MongoDB. San Francisco MongoDB meetup in June 2011.

Otto Hilska's presentation about Flowdock's full-text search with MongoDB. San Francisco MongoDB meetup in June 2011.

Statistics

Views

Total Views
3,767
Views on SlideShare
3,431
Embed Views
336

Actions

Likes
1
Downloads
25
Comments
0

6 Embeds 336

http://livingjunction.com 330
http://paper.li 2
http://www.slideshare.net 1
http://us-w1.rockmelt.com 1
http://twitter.com 1
http://www.docshut.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Flowdock's full-text search with MongoDB Flowdock's full-text search with MongoDB Presentation Transcript

  • Full-text search with MongoDB Otto Hilska, @mutru @flowdockThursday, July 7, 2011 1
  • Thursday, July 7, 2011 2APIdock.com is one of the services we’ve created for the Ruby community: a socialdocumentation site.
  • Thursday, July 7, 2011 3- We did some “research” about real-time web back in 2008.- At the same time, did software consulting for large companies.- Flowdock is a product spinoff from our consulting company. It’s Google Wave done right,with focus on technical teams.
  • Thursday, July 7, 2011 4Flowdock combines a group chat (on the right) to a shared team inbox (on the left).Our promise: Teams stay up-to-date, react in seconds instead of hours, and never forgetanything.
  • Thursday, July 7, 2011 5Flowdock gets messages from various external sources (like JIRA, Twitter, Github, PivotalTracker, emails, RSS feeds) and from the Flowdock users themselves.
  • Thursday, July 7, 2011 6All of the highlighted areas are objects in the “messages” collection. MongoDB’s documentmodel is perfect for our use case, where various data formats (tweets, emails, ...) are storedinside the same collection.
  • Thursday, July 7, 2011 6All of the highlighted areas are objects in the “messages” collection. MongoDB’s documentmodel is perfect for our use case, where various data formats (tweets, emails, ...) are storedinside the same collection.
  • Thursday, July 7, 2011 6All of the highlighted areas are objects in the “messages” collection. MongoDB’s documentmodel is perfect for our use case, where various data formats (tweets, emails, ...) are storedinside the same collection.
  • Thursday, July 7, 2011 6All of the highlighted areas are objects in the “messages” collection. MongoDB’s documentmodel is perfect for our use case, where various data formats (tweets, emails, ...) are storedinside the same collection.
  • Thursday, July 7, 2011 7This is how a typical message looks like.
  • {    "_id":ObjectId("4de92cd0097580e29ca5b6c2"),    "id":NumberLong(45967),    "app":"chat",    "flow":"demo:demoflow",    "event":"comment",    "sent":NumberLong("1307126992832"),    "attachments":[    ],    "_keywords":[       "good",       "point", ...    ],    "uuid":"hC4-09hFcULvCyiU",    "user":"1",    "content":{       "text":"Good point, Ill mark it as deprecated.",       "title":"Updated  JIRA integration API"    },    "tags":[       "influx:45958"    ] }Thursday, July 7, 2011 7This is how a typical message looks like.
  • Browser jQuery (+UI) Comet impl. MVC impl. Rails app Scala backend Website Messages Admin Who’s online Payments API Account mgmt RSS feeds SMTP server Twitter feed PostgreSQL MongoDBThursday, July 7, 2011 8An overview of the Flowdock architecture: most of the code is JavaScript and runs inside thebrowser.The Scala (+Akka) backend does all the heavy lifting (mostly related to messages and onlinepresence), and the Ruby on Rails application handles all the easy stuff (public website,account management, administration, payments etc).We used PostgreSQL in the beginning, and migrated messages to MongoDB. Otherwise thereis no particular reason why we couldn’t use MongoDB for everything.
  • Thursday, July 7, 2011 9One of the key features in Flowdock is tagging. For example, if I’m doing some changes toour production environment, I mention it in the chat and tag it as #production. Productiondeployments are automatically tagged with the same tag, so I can easily get a log ofeverything that’s happened.It’s very easy to implement with MongoDB, since we just index the “tags” array and searchusing it.
  • db.messages.ensureIndex({flow: 1, tags: 1, id: -1});Thursday, July 7, 2011 9One of the key features in Flowdock is tagging. For example, if I’m doing some changes toour production environment, I mention it in the chat and tag it as #production. Productiondeployments are automatically tagged with the same tag, so I can easily get a log ofeverything that’s happened.It’s very easy to implement with MongoDB, since we just index the “tags” array and searchusing it.
  • db.messages.ensureIndex({flow: 1, tags: 1, id: -1}); db.messages.find({flow: 123, tags: {$all: [“production”]}) .sort({id: -1});Thursday, July 7, 2011 9One of the key features in Flowdock is tagging. For example, if I’m doing some changes toour production environment, I mention it in the chat and tag it as #production. Productiondeployments are automatically tagged with the same tag, so I can easily get a log ofeverything that’s happened.It’s very easy to implement with MongoDB, since we just index the “tags” array and searchusing it.
  • https://jira.mongodb.org/browse/SERVER-380Thursday, July 7, 2011 10There’s a JIRA ticket about full-text search for MongoDB.Users have built lots of their own implementations, but the discussion continues.
  • Library support • Stemming • Ranked probabilistic search • Synonyms • Spelling corrections • Boolean, phrase, word proximity queriesThursday, July 7, 2011 11These are some of the features you might see in an advanced full-text searchimplementation. There are libraries to do some parts of this (like libraries specific tostemming), and more advanced search libraries like Lucene and Xapian.Lucene is a Java library (also ported to C++ etc.), and Xapian is a C++ library.Many of these are hackable with MongoDB by expanding the query.
  • Standalone server Standalone server Standalone server Lucene based Lucene queries MySQL integration Rich document REST/JSON API Real-time indexing support Real-time indexing Distributed Result highlighting Distributed searching DistributedThursday, July 7, 2011 12You can use the libraries directly, but they don’t do anything to guarantee replication &scaling.Standalone implementations usually handle clustering, query processing and some moreadvanced features.
  • Things to consider • Data access patterns • Technology stack • Data duplication • Use cases: need to search Word documents? Need to support boolean queries? ...Thursday, July 7, 2011 13When choosing your solution, you’ll want to keep it simple, consider how write-heavy yourapp is, what special features do you need, can you afford to store the data 3 times in aMongoDB replica set + 2 times in a search server etc.
  • Real-time sear ch PerformanceThursday, July 7, 2011 14There are tons of use cases where search doesn’t need to be real-time. It’s a requirementthat will heavily impact your application.
  • KISSThursday, July 7, 2011 15As a lean startup, we can’t afford to spend a lot of time with technology adventures. Need tomeasure what customers want.Many of the features are possible to achieve with MongoDB.Facebook messages search also searches exact word matches (=it sucks), and people don’tcomplain.So we built a minimal implementation with MongoDB. No stemming or anything, just akeyword search, but it needs to be real-time.
  • KISS Even Facebook does.Thursday, July 7, 2011 15As a lean startup, we can’t afford to spend a lot of time with technology adventures. Need tomeasure what customers want.Many of the features are possible to achieve with MongoDB.Facebook messages search also searches exact word matches (=it sucks), and people don’tcomplain.So we built a minimal implementation with MongoDB. No stemming or anything, just akeyword search, but it needs to be real-time.
  • “Good point. I’ll mark it as deprecated.” _keywords: [“good”, “point”, “mark”, “deprecated”]Thursday, July 7, 2011 16You need client-side code for this transformation.What’s possible: stemming, search by beginning of the wordWhat’s not possible: intelligent ranking on the DB side (which is ok for us, since we want tosort results by time anyway)
  • db.messages.ensureIndex({ flow: 1, _keywords: 1, id: -1});Thursday, July 7, 2011 17Simply build the _keywords index the same way we already had our tags indexed.
  • db.messages.find({ flow: 123, _keywords: { $all: [“hello”, “world”]} }).sort({id: -1});Thursday, July 7, 2011 18Search is also trivial to implement. As said, our users want the messages in chronologicalorder, which makes this a lot easier.
  • That’s it! Let’s take it to production.Thursday, July 7, 2011 19A minimal search implementation is the easy part. We faced quite a few operational issueswhen deploying it to production.
  • Index size: 2500 MB per 1M messagesThursday, July 7, 2011 20As it turns out, the _keywords index is pretty big.
  • 10M messages: Size in gigabytes 20.00 15.00 10.00 5.00 0 Messages Index: Keywords Index: Tags Index: OthersThursday, July 7, 2011 21Would be great to fit indices to the memory. Now it obviously doesn’t. Stemming will reducethe index size.Has implications for example to insert/update performance.
  • 10M messages: Size in gigabytes 20.00 15.00 10.00 5.00 0 Messages Index: Keywords Index: Tags Index: OthersThursday, July 7, 2011 21Would be great to fit indices to the memory. Now it obviously doesn’t. Stemming will reducethe index size.Has implications for example to insert/update performance.
  • Option #1: Just generate _keywords and build the index in background.Thursday, July 7, 2011 22The naive solution: try to do it with no downtime. Didn’t work, site slowed down too much.
  • Option #2: Try to do it during a 6 hour service break.Thursday, July 7, 2011 23It worked much faster when our users weren’t constantly accessing the data. But 6 hoursduring a weekend wasn’t enough, and we had to cancel the migration.
  • Option #3: Delete _keywords, build the index and re-generate keywords in the background.Thursday, July 7, 2011 24Generating an index is much faster when there is no data to index. Building the index wasfine, but generating keywords was very slow and took the site down.
  • Option #4: As previously, but add sleep(5).Thursday, July 7, 2011 25You know you’re a great programmer when you’re adding sleep()s to your production code.
  • Option #5: As previously, but add Write Concerns.Thursday, July 7, 2011 26Let the queries block, so that if MongoDB slows down, the migration script doesn’t flood theserver.Yeah, it would’ve taken a month, or it would’ve slowed down the service.
  • Option #6: Shard.Thursday, July 7, 2011 27Would have been a solution, but we didn’t want to host all that data in-memory, since it’s notaccessed that often.
  • Option #7: SSD!Thursday, July 7, 2011 28We had the possibility to try it on a SSD disk.This is not a viable alternative to AWS users, but AWS users could temporarily shard their datato a large number of high-memory instances.
  • Option #7: SSD!Thursday, July 7, 2011 28We had the possibility to try it on a SSD disk.This is not a viable alternative to AWS users, but AWS users could temporarily shard their datato a large number of high-memory instances.
  • Option #7: SSD!Thursday, July 7, 2011 28We had the possibility to try it on a SSD disk.This is not a viable alternative to AWS users, but AWS users could temporarily shard their datato a large number of high-memory instances.
  • Thursday, July 7, 2011 29My reactions to using SSD. Decided to benchmark it.
  • 10M messages in 100 “flows”, Messages 100k each Total size 19.67 GB _id: 1 flow: 1, app: 1, id: -1 flow: 1, event: 1, id: -1 flow: 1, id: -1 Indices flow: 1, tags: 1, id: -1 flow: 1, _keywords: 1, id: -1 Total size 22.03 GBThursday, July 7, 2011 30This is the starting point for my next benchmark. Wanted to test it with a real-size database,instead of starting from scratch.
  • mongorestore time in minutes 300.00 225.00 150.00 75.00 0 SSD SATAThursday, July 7, 2011 31First used mongorestore to populate the test database.133 vs. 235 minutes, and index generation is mostly CPU-bound.Doesn’t really benefit from the faster seek times.
  • Insert performance test A total of 100 workspaces And 3 workers each accessing 30 workspaces Performing 1000 inserts to each = 90 000 inserts, as quickly as possibleThursday, July 7, 2011 32
  • insert benchmark: time in minutes 200.00 150.00 100.00 50.00 0 SSD SATAThursday, July 7, 2011 334.25 vs 155. That’s 4 minutes vs. 2.5 hours.
  • 9.67 inserts/sec vs. 352.94 inserts/secThursday, July 7, 2011 34The same numbers as inserts/sec.
  • 36xThursday, July 7, 2011 3536x performance improvement with SSD. So we ended up using it in production.
  • Thursday, July 7, 2011 36Works well, searches from all kinds of content (here Git commit messages and deploymentemails), queries typically take only tens of milliseconds max.
  • Questions / Comments? @flowdock / otto@flowdock.comThursday, July 7, 2011 37This was a very specific full-text search implementation. The fact that we didn’t need to ranksearch results made it trivial.I’m happy to discuss other use cases. Please share your thoughts and experiences.