SlideShare a Scribd company logo
Full-text search with
                               MongoDB
                              Otto Hilska, @mutru
                                  @flowdock




Thursday, July 7, 2011                              1
Thursday, July 7, 2011                                                              2

APIdock.com is one of the services we’ve created for the Ruby community: a social
documentation site.
Thursday, July 7, 2011                                                                      3

- We did some “research” about real-time web back in 2008.
- At the same time, did software consulting for large companies.
- Flowdock is a product spinoff from our consulting company. It’s Google Wave done right,
with focus on technical teams.
Thursday, July 7, 2011                                                                    4

Flowdock combines a group chat (on the right) to a shared team inbox (on the left).

Our promise: Teams stay up-to-date, react in seconds instead of hours, and never forget
anything.
Thursday, July 7, 2011                                                                      5

Flowdock gets messages from various external sources (like JIRA, Twitter, Github, Pivotal
Tracker, emails, RSS feeds) and from the Flowdock users themselves.
Thursday, July 7, 2011                                                                           6

All of the highlighted areas are objects in the “messages” collection. MongoDB’s document
model is perfect for our use case, where various data formats (tweets, emails, ...) are stored
inside the same collection.
Thursday, July 7, 2011                                                                           6

All of the highlighted areas are objects in the “messages” collection. MongoDB’s document
model is perfect for our use case, where various data formats (tweets, emails, ...) are stored
inside the same collection.
Thursday, July 7, 2011                                                                           6

All of the highlighted areas are objects in the “messages” collection. MongoDB’s document
model is perfect for our use case, where various data formats (tweets, emails, ...) are stored
inside the same collection.
Thursday, July 7, 2011                                                                           6

All of the highlighted areas are objects in the “messages” collection. MongoDB’s document
model is perfect for our use case, where various data formats (tweets, emails, ...) are stored
inside the same collection.
Thursday, July 7, 2011                      7

This is how a typical message looks like.
{
                "_id":ObjectId("4de92cd0097580e29ca5b6c2"),
                "id":NumberLong(45967),
                "app":"chat",
                "flow":"demo:demoflow",
                "event":"comment",
                "sent":NumberLong("1307126992832"),
                "attachments":[

                ],
                "_keywords":[
                   "good",
                   "point", ...
                ],
                "uuid":"hC4-09hFcULvCyiU",
                "user":"1",
                "content":{
                   "text":"Good point, I'll mark it as deprecated.",
                   "title":"Updated  JIRA integration API"
                },
                "tags":[
                   "influx:45958"
                ]
             }

Thursday, July 7, 2011                                                 7

This is how a typical message looks like.
Browser

                                 jQuery (+UI)
                                 Comet impl.
                                 MVC impl.


                         Rails app                Scala backend
                         Website                  Messages
                         Admin                    Who’s online
                         Payments                 API
                         Account mgmt             RSS feeds
                                                  SMTP server
                                                  Twitter feed



                         PostgreSQL                 MongoDB

Thursday, July 7, 2011                                                                         8

An overview of the Flowdock architecture: most of the code is JavaScript and runs inside the
browser.

The Scala (+Akka) backend does all the heavy lifting (mostly related to messages and online
presence), and the Ruby on Rails application handles all the easy stuff (public website,
account management, administration, payments etc).

We used PostgreSQL in the beginning, and migrated messages to MongoDB. Otherwise there
is no particular reason why we couldn’t use MongoDB for everything.
Thursday, July 7, 2011                                                                       9

One of the key features in Flowdock is tagging. For example, if I’m doing some changes to
our production environment, I mention it in the chat and tag it as #production. Production
deployments are automatically tagged with the same tag, so I can easily get a log of
everything that’s happened.

It’s very easy to implement with MongoDB, since we just index the “tags” array and search
using it.
db.messages.ensureIndex({flow: 1, tags: 1, id: -1});




Thursday, July 7, 2011                                                                       9

One of the key features in Flowdock is tagging. For example, if I’m doing some changes to
our production environment, I mention it in the chat and tag it as #production. Production
deployments are automatically tagged with the same tag, so I can easily get a log of
everything that’s happened.

It’s very easy to implement with MongoDB, since we just index the “tags” array and search
using it.
db.messages.ensureIndex({flow: 1, tags: 1, id: -1});

                          db.messages.find({flow: 123,
                           tags: {$all: [“production”]})
                                   .sort({id: -1});
Thursday, July 7, 2011                                                                       9

One of the key features in Flowdock is tagging. For example, if I’m doing some changes to
our production environment, I mention it in the chat and tag it as #production. Production
deployments are automatically tagged with the same tag, so I can easily get a log of
everything that’s happened.

It’s very easy to implement with MongoDB, since we just index the “tags” array and search
using it.
https://jira.mongodb.org/browse/SERVER-380




Thursday, July 7, 2011                                                              10

There’s a JIRA ticket about full-text search for MongoDB.
Users have built lots of their own implementations, but the discussion continues.
Library support
                     • Stemming
                     • Ranked probabilistic search
                     • Synonyms
                     • Spelling corrections
                     • Boolean, phrase, word proximity queries


Thursday, July 7, 2011                                                                    11

These are some of the features you might see in an advanced full-text search
implementation. There are libraries to do some parts of this (like libraries specific to
stemming), and more advanced search libraries like Lucene and Xapian.

Lucene is a Java library (also ported to C++ etc.), and Xapian is a C++ library.

Many of these are hackable with MongoDB by expanding the query.
Standalone server           Standalone server               Standalone server
     Lucene based                Lucene queries                  MySQL integration
     Rich document               REST/JSON API                   Real-time indexing
     support                     Real-time indexing              Distributed
     Result highlighting         Distributed                     searching
     Distributed




Thursday, July 7, 2011                                                                      12

You can use the libraries directly, but they don’t do anything to guarantee replication &
scaling.

Standalone implementations usually handle clustering, query processing and some more
advanced features.
Things to consider
                     • Data access patterns
                     • Technology stack
                     • Data duplication
                     • Use cases: need to search Word
                         documents? Need to support boolean
                         queries? ...


Thursday, July 7, 2011                                                                      13

When choosing your solution, you’ll want to keep it simple, consider how write-heavy your
app is, what special features do you need, can you afford to store the data 3 times in a
MongoDB replica set + 2 times in a search server etc.
Real-time sear
                                       ch
                                                            Performance




Thursday, July 7, 2011                                                                      14

There are tons of use cases where search doesn’t need to be real-time. It’s a requirement
that will heavily impact your application.
KISS

Thursday, July 7, 2011                                                                          15

As a lean startup, we can’t afford to spend a lot of time with technology adventures. Need to
measure what customers want.
Many of the features are possible to achieve with MongoDB.
Facebook messages search also searches exact word matches (=it sucks), and people don’t
complain.

So we built a minimal implementation with MongoDB. No stemming or anything, just a
keyword search, but it needs to be real-time.
KISS
                            Even Facebook does.




Thursday, July 7, 2011                                                                          15

As a lean startup, we can’t afford to spend a lot of time with technology adventures. Need to
measure what customers want.
Many of the features are possible to achieve with MongoDB.
Facebook messages search also searches exact word matches (=it sucks), and people don’t
complain.

So we built a minimal implementation with MongoDB. No stemming or anything, just a
keyword search, but it needs to be real-time.
“Good point. I’ll mark it as deprecated.”




             _keywords: [“good”, “point”, “mark”, “deprecated”]




Thursday, July 7, 2011                                                                          16

You need client-side code for this transformation.
What’s possible: stemming, search by beginning of the word
What’s not possible: intelligent ranking on the DB side (which is ok for us, since we want to
sort results by time anyway)
db.messages.ensureIndex({
                          flow: 1,
                          _keywords: 1,
                          id: -1});



Thursday, July 7, 2011                                                           17

Simply build the _keywords index the same way we already had our tags indexed.
db.messages.find({
                          flow: 123,
                          _keywords: {
                            $all: [“hello”, “world”]}
                         }).sort({id: -1});



Thursday, July 7, 2011                                                                       18

Search is also trivial to implement. As said, our users want the messages in chronological
order, which makes this a lot easier.
That’s it! Let’s take it to production.




Thursday, July 7, 2011                                                                      19

A minimal search implementation is the easy part. We faced quite a few operational issues
when deploying it to production.
Index size:
                         2500 MB per 1M messages




Thursday, July 7, 2011                                20

As it turns out, the _keywords index is pretty big.
10M messages: Size in gigabytes


                  20.00



                  15.00



                  10.00



                    5.00



                         0
                             Messages    Index: Keywords     Index: Tags   Index: Others



Thursday, July 7, 2011                                                                       21

Would be great to fit indices to the memory. Now it obviously doesn’t. Stemming will reduce
the index size.
Has implications for example to insert/update performance.
10M messages: Size in gigabytes


                  20.00



                  15.00



                  10.00



                    5.00



                         0
                             Messages    Index: Keywords     Index: Tags   Index: Others



Thursday, July 7, 2011                                                                       21

Would be great to fit indices to the memory. Now it obviously doesn’t. Stemming will reduce
the index size.
Has implications for example to insert/update performance.
Option #1:
                         Just generate _keywords and build
                               the index in background.




Thursday, July 7, 2011                                                                       22

The naive solution: try to do it with no downtime. Didn’t work, site slowed down too much.
Option #2:
                         Try to do it during a 6 hour
                                service break.




Thursday, July 7, 2011                                                                    23

It worked much faster when our users weren’t constantly accessing the data. But 6 hours
during a weekend wasn’t enough, and we had to cancel the migration.
Option #3:
        Delete _keywords, build the index
   and re-generate keywords in the background.




Thursday, July 7, 2011                                                                      24

Generating an index is much faster when there is no data to index. Building the index was
fine, but generating keywords was very slow and took the site down.
Option #4:
                         As previously, but add sleep(5).




Thursday, July 7, 2011                                                                    25

You know you’re a great programmer when you’re adding sleep()s to your production code.
Option #5:
               As previously, but add Write Concerns.




Thursday, July 7, 2011                                                                        26

Let the queries block, so that if MongoDB slows down, the migration script doesn’t flood the
server.

Yeah, it would’ve taken a month, or it would’ve slowed down the service.
Option #6:
                                        Shard.




Thursday, July 7, 2011                                                                       27

Would have been a solution, but we didn’t want to host all that data in-memory, since it’s not
accessed that often.
Option #7:
                                            SSD!




Thursday, July 7, 2011                                                                       28

We had the possibility to try it on a SSD disk.

This is not a viable alternative to AWS users, but AWS users could temporarily shard their data
to a large number of high-memory instances.
Option #7:
                                            SSD!




Thursday, July 7, 2011                                                                       28

We had the possibility to try it on a SSD disk.

This is not a viable alternative to AWS users, but AWS users could temporarily shard their data
to a large number of high-memory instances.
Option #7:
                                            SSD!




Thursday, July 7, 2011                                                                       28

We had the possibility to try it on a SSD disk.

This is not a viable alternative to AWS users, but AWS users could temporarily shard their data
to a large number of high-memory instances.
Thursday, July 7, 2011                                29

My reactions to using SSD. Decided to benchmark it.
10M messages
                                                        in 100 “flows”,
                         Messages                          100k each

                                                     Total size 19.67 GB

                                                             _id: 1
                                                      flow: 1, app: 1, id: -1
                                                     flow: 1, event: 1, id: -1
                                                          flow: 1, id: -1
                          Indices                     flow: 1, tags: 1, id: -1
                                                 flow: 1, _keywords: 1, id: -1

                                                        Total size 22.03 GB




Thursday, July 7, 2011                                                                       30

This is the starting point for my next benchmark. Wanted to test it with a real-size database,
instead of starting from scratch.
mongorestore time in minutes

                300.00



                225.00



                150.00



                  75.00



                         0
                             SSD                                 SATA




Thursday, July 7, 2011                                                  31

First used mongorestore to populate the test database.
133 vs. 235 minutes, and index generation is mostly CPU-bound.
Doesn’t really benefit from the faster seek times.
Insert performance test

                 A total of 100 workspaces
                 And 3 workers each accessing 30 workspaces
                 Performing 1000 inserts to each

                 = 90 000 inserts, as quickly as possible



Thursday, July 7, 2011                                        32
insert benchmark: time in minutes

                200.00



                150.00



                100.00



                  50.00



                         0
                               SSD                                 SATA




Thursday, July 7, 2011                                                    33

4.25 vs 155. That’s 4 minutes vs. 2.5 hours.
9.67 inserts/sec
                                   vs.

                         352.94 inserts/sec


Thursday, July 7, 2011                        34

The same numbers as inserts/sec.
36x
Thursday, July 7, 2011                                                         35

36x performance improvement with SSD. So we ended up using it in production.
Thursday, July 7, 2011                                                                    36

Works well, searches from all kinds of content (here Git commit messages and deployment
emails), queries typically take only tens of milliseconds max.
Questions / Comments?
                          @flowdock / otto@flowdock.com




Thursday, July 7, 2011                                                                      37

This was a very specific full-text search implementation. The fact that we didn’t need to rank
search results made it trivial.

I’m happy to discuss other use cases. Please share your thoughts and experiences.

More Related Content

Viewers also liked

MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329
Douglas Duncan
 
Webinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance ImplicationsWebinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance Implications
MongoDB
 
MEAN Stack
MEAN StackMEAN Stack
MEAN Stack
Krishnaprasad k
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDB
MongoDB
 
Big Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityBig Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data Quality
RTTS
 
Comparing 30 MongoDB operations with Oracle SQL statements
Comparing 30 MongoDB operations with Oracle SQL statementsComparing 30 MongoDB operations with Oracle SQL statements
Comparing 30 MongoDB operations with Oracle SQL statements
Lucas Jellema
 

Viewers also liked (6)

MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329
 
Webinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance ImplicationsWebinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance Implications
 
MEAN Stack
MEAN StackMEAN Stack
MEAN Stack
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDB
 
Big Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityBig Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data Quality
 
Comparing 30 MongoDB operations with Oracle SQL statements
Comparing 30 MongoDB operations with Oracle SQL statementsComparing 30 MongoDB operations with Oracle SQL statements
Comparing 30 MongoDB operations with Oracle SQL statements
 

Similar to Flowdock's full-text search with MongoDB

Google AppEngine @Open World Forum 2012 - 12 oct.2012
Google AppEngine @Open World Forum 2012 - 12 oct.2012Google AppEngine @Open World Forum 2012 - 12 oct.2012
Google AppEngine @Open World Forum 2012 - 12 oct.2012
Paris Open Source Summit
 
Publishing strategies for API documentation
Publishing strategies for API documentationPublishing strategies for API documentation
Publishing strategies for API documentation
Tom Johnson
 
MongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema DesignMongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema Design
DATAVERSITY
 
Django Seminar
Django SeminarDjango Seminar
Django Seminar
Yokesh Rana
 
CTE 323 - Lecture 1.pptx
CTE 323 - Lecture 1.pptxCTE 323 - Lecture 1.pptx
CTE 323 - Lecture 1.pptx
OduniyiAdebola
 
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
ikailan
 
What Web Framework To Use?
What Web Framework To Use?What Web Framework To Use?
What Web Framework To Use?
Kasra Khosravi
 
Approaches to mobile site development
Approaches to mobile site developmentApproaches to mobile site development
Approaches to mobile site development
Erik Mitchell
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring Applications
Damien Dallimore
 
Javascript Frameworks Comparison
Javascript Frameworks ComparisonJavascript Frameworks Comparison
Javascript Frameworks Comparison
Deepu S Nath
 
Top 11 Front-End Web Development Tools To Consider in 2020
 Top 11 Front-End Web Development Tools To Consider in 2020 Top 11 Front-End Web Development Tools To Consider in 2020
Top 11 Front-End Web Development Tools To Consider in 2020
Katy Slemon
 
10 interesting things about java
10 interesting things about java10 interesting things about java
10 interesting things about java
kanchanmahajan23
 
Silicon Valley Code Camp 2011: Play! as you REST
Silicon Valley Code Camp 2011: Play! as you RESTSilicon Valley Code Camp 2011: Play! as you REST
Silicon Valley Code Camp 2011: Play! as you REST
Manish Pandit
 
App Engine Meetup
App Engine MeetupApp Engine Meetup
App Engine Meetup
John Woodell
 
Javascript Frameworks Comparison - Angular, Knockout, Ember and Backbone
Javascript Frameworks Comparison - Angular, Knockout, Ember and BackboneJavascript Frameworks Comparison - Angular, Knockout, Ember and Backbone
Javascript Frameworks Comparison - Angular, Knockout, Ember and Backbone
Deepu S Nath
 
Node.js and MongoDB from scratch, fully explained and tested
Node.js and MongoDB from scratch, fully explained and tested Node.js and MongoDB from scratch, fully explained and tested
Node.js and MongoDB from scratch, fully explained and tested
John Culviner
 
Using NoSQL with Yo' SQL
Using NoSQL with Yo' SQLUsing NoSQL with Yo' SQL
Using NoSQL with Yo' SQL
Rich Thornett
 
JavaOne 2011 - Going Mobile With Java Based Technologies Today
JavaOne 2011 - Going Mobile With Java Based Technologies TodayJavaOne 2011 - Going Mobile With Java Based Technologies Today
JavaOne 2011 - Going Mobile With Java Based Technologies TodayWesley Hales
 
Top 10 Front End Development Technologies to Focus in 2018
Top 10 Front End Development Technologies to Focus in 2018Top 10 Front End Development Technologies to Focus in 2018
Top 10 Front End Development Technologies to Focus in 2018
Helios Solutions
 

Similar to Flowdock's full-text search with MongoDB (20)

OWF12/Java Moussine pouchkine Girard
OWF12/Java  Moussine pouchkine GirardOWF12/Java  Moussine pouchkine Girard
OWF12/Java Moussine pouchkine Girard
 
Google AppEngine @Open World Forum 2012 - 12 oct.2012
Google AppEngine @Open World Forum 2012 - 12 oct.2012Google AppEngine @Open World Forum 2012 - 12 oct.2012
Google AppEngine @Open World Forum 2012 - 12 oct.2012
 
Publishing strategies for API documentation
Publishing strategies for API documentationPublishing strategies for API documentation
Publishing strategies for API documentation
 
MongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema DesignMongoDB at Sailthru: Scaling and Schema Design
MongoDB at Sailthru: Scaling and Schema Design
 
Django Seminar
Django SeminarDjango Seminar
Django Seminar
 
CTE 323 - Lecture 1.pptx
CTE 323 - Lecture 1.pptxCTE 323 - Lecture 1.pptx
CTE 323 - Lecture 1.pptx
 
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
 
What Web Framework To Use?
What Web Framework To Use?What Web Framework To Use?
What Web Framework To Use?
 
Approaches to mobile site development
Approaches to mobile site developmentApproaches to mobile site development
Approaches to mobile site development
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring Applications
 
Javascript Frameworks Comparison
Javascript Frameworks ComparisonJavascript Frameworks Comparison
Javascript Frameworks Comparison
 
Top 11 Front-End Web Development Tools To Consider in 2020
 Top 11 Front-End Web Development Tools To Consider in 2020 Top 11 Front-End Web Development Tools To Consider in 2020
Top 11 Front-End Web Development Tools To Consider in 2020
 
10 interesting things about java
10 interesting things about java10 interesting things about java
10 interesting things about java
 
Silicon Valley Code Camp 2011: Play! as you REST
Silicon Valley Code Camp 2011: Play! as you RESTSilicon Valley Code Camp 2011: Play! as you REST
Silicon Valley Code Camp 2011: Play! as you REST
 
App Engine Meetup
App Engine MeetupApp Engine Meetup
App Engine Meetup
 
Javascript Frameworks Comparison - Angular, Knockout, Ember and Backbone
Javascript Frameworks Comparison - Angular, Knockout, Ember and BackboneJavascript Frameworks Comparison - Angular, Knockout, Ember and Backbone
Javascript Frameworks Comparison - Angular, Knockout, Ember and Backbone
 
Node.js and MongoDB from scratch, fully explained and tested
Node.js and MongoDB from scratch, fully explained and tested Node.js and MongoDB from scratch, fully explained and tested
Node.js and MongoDB from scratch, fully explained and tested
 
Using NoSQL with Yo' SQL
Using NoSQL with Yo' SQLUsing NoSQL with Yo' SQL
Using NoSQL with Yo' SQL
 
JavaOne 2011 - Going Mobile With Java Based Technologies Today
JavaOne 2011 - Going Mobile With Java Based Technologies TodayJavaOne 2011 - Going Mobile With Java Based Technologies Today
JavaOne 2011 - Going Mobile With Java Based Technologies Today
 
Top 10 Front End Development Technologies to Focus in 2018
Top 10 Front End Development Technologies to Focus in 2018Top 10 Front End Development Technologies to Focus in 2018
Top 10 Front End Development Technologies to Focus in 2018
 

Recently uploaded

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 

Recently uploaded (20)

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 

Flowdock's full-text search with MongoDB

  • 1. Full-text search with MongoDB Otto Hilska, @mutru @flowdock Thursday, July 7, 2011 1
  • 2. Thursday, July 7, 2011 2 APIdock.com is one of the services we’ve created for the Ruby community: a social documentation site.
  • 3. Thursday, July 7, 2011 3 - We did some “research” about real-time web back in 2008. - At the same time, did software consulting for large companies. - Flowdock is a product spinoff from our consulting company. It’s Google Wave done right, with focus on technical teams.
  • 4. Thursday, July 7, 2011 4 Flowdock combines a group chat (on the right) to a shared team inbox (on the left). Our promise: Teams stay up-to-date, react in seconds instead of hours, and never forget anything.
  • 5. Thursday, July 7, 2011 5 Flowdock gets messages from various external sources (like JIRA, Twitter, Github, Pivotal Tracker, emails, RSS feeds) and from the Flowdock users themselves.
  • 6. Thursday, July 7, 2011 6 All of the highlighted areas are objects in the “messages” collection. MongoDB’s document model is perfect for our use case, where various data formats (tweets, emails, ...) are stored inside the same collection.
  • 7. Thursday, July 7, 2011 6 All of the highlighted areas are objects in the “messages” collection. MongoDB’s document model is perfect for our use case, where various data formats (tweets, emails, ...) are stored inside the same collection.
  • 8. Thursday, July 7, 2011 6 All of the highlighted areas are objects in the “messages” collection. MongoDB’s document model is perfect for our use case, where various data formats (tweets, emails, ...) are stored inside the same collection.
  • 9. Thursday, July 7, 2011 6 All of the highlighted areas are objects in the “messages” collection. MongoDB’s document model is perfect for our use case, where various data formats (tweets, emails, ...) are stored inside the same collection.
  • 10. Thursday, July 7, 2011 7 This is how a typical message looks like.
  • 11. {    "_id":ObjectId("4de92cd0097580e29ca5b6c2"),    "id":NumberLong(45967),    "app":"chat",    "flow":"demo:demoflow",    "event":"comment",    "sent":NumberLong("1307126992832"),    "attachments":[    ],    "_keywords":[       "good",       "point", ...    ],    "uuid":"hC4-09hFcULvCyiU",    "user":"1",    "content":{       "text":"Good point, I'll mark it as deprecated.",       "title":"Updated  JIRA integration API"    },    "tags":[       "influx:45958"    ] } Thursday, July 7, 2011 7 This is how a typical message looks like.
  • 12. Browser jQuery (+UI) Comet impl. MVC impl. Rails app Scala backend Website Messages Admin Who’s online Payments API Account mgmt RSS feeds SMTP server Twitter feed PostgreSQL MongoDB Thursday, July 7, 2011 8 An overview of the Flowdock architecture: most of the code is JavaScript and runs inside the browser. The Scala (+Akka) backend does all the heavy lifting (mostly related to messages and online presence), and the Ruby on Rails application handles all the easy stuff (public website, account management, administration, payments etc). We used PostgreSQL in the beginning, and migrated messages to MongoDB. Otherwise there is no particular reason why we couldn’t use MongoDB for everything.
  • 13. Thursday, July 7, 2011 9 One of the key features in Flowdock is tagging. For example, if I’m doing some changes to our production environment, I mention it in the chat and tag it as #production. Production deployments are automatically tagged with the same tag, so I can easily get a log of everything that’s happened. It’s very easy to implement with MongoDB, since we just index the “tags” array and search using it.
  • 14. db.messages.ensureIndex({flow: 1, tags: 1, id: -1}); Thursday, July 7, 2011 9 One of the key features in Flowdock is tagging. For example, if I’m doing some changes to our production environment, I mention it in the chat and tag it as #production. Production deployments are automatically tagged with the same tag, so I can easily get a log of everything that’s happened. It’s very easy to implement with MongoDB, since we just index the “tags” array and search using it.
  • 15. db.messages.ensureIndex({flow: 1, tags: 1, id: -1}); db.messages.find({flow: 123, tags: {$all: [“production”]}) .sort({id: -1}); Thursday, July 7, 2011 9 One of the key features in Flowdock is tagging. For example, if I’m doing some changes to our production environment, I mention it in the chat and tag it as #production. Production deployments are automatically tagged with the same tag, so I can easily get a log of everything that’s happened. It’s very easy to implement with MongoDB, since we just index the “tags” array and search using it.
  • 16. https://jira.mongodb.org/browse/SERVER-380 Thursday, July 7, 2011 10 There’s a JIRA ticket about full-text search for MongoDB. Users have built lots of their own implementations, but the discussion continues.
  • 17. Library support • Stemming • Ranked probabilistic search • Synonyms • Spelling corrections • Boolean, phrase, word proximity queries Thursday, July 7, 2011 11 These are some of the features you might see in an advanced full-text search implementation. There are libraries to do some parts of this (like libraries specific to stemming), and more advanced search libraries like Lucene and Xapian. Lucene is a Java library (also ported to C++ etc.), and Xapian is a C++ library. Many of these are hackable with MongoDB by expanding the query.
  • 18. Standalone server Standalone server Standalone server Lucene based Lucene queries MySQL integration Rich document REST/JSON API Real-time indexing support Real-time indexing Distributed Result highlighting Distributed searching Distributed Thursday, July 7, 2011 12 You can use the libraries directly, but they don’t do anything to guarantee replication & scaling. Standalone implementations usually handle clustering, query processing and some more advanced features.
  • 19. Things to consider • Data access patterns • Technology stack • Data duplication • Use cases: need to search Word documents? Need to support boolean queries? ... Thursday, July 7, 2011 13 When choosing your solution, you’ll want to keep it simple, consider how write-heavy your app is, what special features do you need, can you afford to store the data 3 times in a MongoDB replica set + 2 times in a search server etc.
  • 20. Real-time sear ch Performance Thursday, July 7, 2011 14 There are tons of use cases where search doesn’t need to be real-time. It’s a requirement that will heavily impact your application.
  • 21. KISS Thursday, July 7, 2011 15 As a lean startup, we can’t afford to spend a lot of time with technology adventures. Need to measure what customers want. Many of the features are possible to achieve with MongoDB. Facebook messages search also searches exact word matches (=it sucks), and people don’t complain. So we built a minimal implementation with MongoDB. No stemming or anything, just a keyword search, but it needs to be real-time.
  • 22. KISS Even Facebook does. Thursday, July 7, 2011 15 As a lean startup, we can’t afford to spend a lot of time with technology adventures. Need to measure what customers want. Many of the features are possible to achieve with MongoDB. Facebook messages search also searches exact word matches (=it sucks), and people don’t complain. So we built a minimal implementation with MongoDB. No stemming or anything, just a keyword search, but it needs to be real-time.
  • 23. “Good point. I’ll mark it as deprecated.” _keywords: [“good”, “point”, “mark”, “deprecated”] Thursday, July 7, 2011 16 You need client-side code for this transformation. What’s possible: stemming, search by beginning of the word What’s not possible: intelligent ranking on the DB side (which is ok for us, since we want to sort results by time anyway)
  • 24. db.messages.ensureIndex({ flow: 1, _keywords: 1, id: -1}); Thursday, July 7, 2011 17 Simply build the _keywords index the same way we already had our tags indexed.
  • 25. db.messages.find({ flow: 123, _keywords: { $all: [“hello”, “world”]} }).sort({id: -1}); Thursday, July 7, 2011 18 Search is also trivial to implement. As said, our users want the messages in chronological order, which makes this a lot easier.
  • 26. That’s it! Let’s take it to production. Thursday, July 7, 2011 19 A minimal search implementation is the easy part. We faced quite a few operational issues when deploying it to production.
  • 27. Index size: 2500 MB per 1M messages Thursday, July 7, 2011 20 As it turns out, the _keywords index is pretty big.
  • 28. 10M messages: Size in gigabytes 20.00 15.00 10.00 5.00 0 Messages Index: Keywords Index: Tags Index: Others Thursday, July 7, 2011 21 Would be great to fit indices to the memory. Now it obviously doesn’t. Stemming will reduce the index size. Has implications for example to insert/update performance.
  • 29. 10M messages: Size in gigabytes 20.00 15.00 10.00 5.00 0 Messages Index: Keywords Index: Tags Index: Others Thursday, July 7, 2011 21 Would be great to fit indices to the memory. Now it obviously doesn’t. Stemming will reduce the index size. Has implications for example to insert/update performance.
  • 30. Option #1: Just generate _keywords and build the index in background. Thursday, July 7, 2011 22 The naive solution: try to do it with no downtime. Didn’t work, site slowed down too much.
  • 31. Option #2: Try to do it during a 6 hour service break. Thursday, July 7, 2011 23 It worked much faster when our users weren’t constantly accessing the data. But 6 hours during a weekend wasn’t enough, and we had to cancel the migration.
  • 32. Option #3: Delete _keywords, build the index and re-generate keywords in the background. Thursday, July 7, 2011 24 Generating an index is much faster when there is no data to index. Building the index was fine, but generating keywords was very slow and took the site down.
  • 33. Option #4: As previously, but add sleep(5). Thursday, July 7, 2011 25 You know you’re a great programmer when you’re adding sleep()s to your production code.
  • 34. Option #5: As previously, but add Write Concerns. Thursday, July 7, 2011 26 Let the queries block, so that if MongoDB slows down, the migration script doesn’t flood the server. Yeah, it would’ve taken a month, or it would’ve slowed down the service.
  • 35. Option #6: Shard. Thursday, July 7, 2011 27 Would have been a solution, but we didn’t want to host all that data in-memory, since it’s not accessed that often.
  • 36. Option #7: SSD! Thursday, July 7, 2011 28 We had the possibility to try it on a SSD disk. This is not a viable alternative to AWS users, but AWS users could temporarily shard their data to a large number of high-memory instances.
  • 37. Option #7: SSD! Thursday, July 7, 2011 28 We had the possibility to try it on a SSD disk. This is not a viable alternative to AWS users, but AWS users could temporarily shard their data to a large number of high-memory instances.
  • 38. Option #7: SSD! Thursday, July 7, 2011 28 We had the possibility to try it on a SSD disk. This is not a viable alternative to AWS users, but AWS users could temporarily shard their data to a large number of high-memory instances.
  • 39. Thursday, July 7, 2011 29 My reactions to using SSD. Decided to benchmark it.
  • 40. 10M messages in 100 “flows”, Messages 100k each Total size 19.67 GB _id: 1 flow: 1, app: 1, id: -1 flow: 1, event: 1, id: -1 flow: 1, id: -1 Indices flow: 1, tags: 1, id: -1 flow: 1, _keywords: 1, id: -1 Total size 22.03 GB Thursday, July 7, 2011 30 This is the starting point for my next benchmark. Wanted to test it with a real-size database, instead of starting from scratch.
  • 41. mongorestore time in minutes 300.00 225.00 150.00 75.00 0 SSD SATA Thursday, July 7, 2011 31 First used mongorestore to populate the test database. 133 vs. 235 minutes, and index generation is mostly CPU-bound. Doesn’t really benefit from the faster seek times.
  • 42. Insert performance test A total of 100 workspaces And 3 workers each accessing 30 workspaces Performing 1000 inserts to each = 90 000 inserts, as quickly as possible Thursday, July 7, 2011 32
  • 43. insert benchmark: time in minutes 200.00 150.00 100.00 50.00 0 SSD SATA Thursday, July 7, 2011 33 4.25 vs 155. That’s 4 minutes vs. 2.5 hours.
  • 44. 9.67 inserts/sec vs. 352.94 inserts/sec Thursday, July 7, 2011 34 The same numbers as inserts/sec.
  • 45. 36x Thursday, July 7, 2011 35 36x performance improvement with SSD. So we ended up using it in production.
  • 46. Thursday, July 7, 2011 36 Works well, searches from all kinds of content (here Git commit messages and deployment emails), queries typically take only tens of milliseconds max.
  • 47. Questions / Comments? @flowdock / otto@flowdock.com Thursday, July 7, 2011 37 This was a very specific full-text search implementation. The fact that we didn’t need to rank search results made it trivial. I’m happy to discuss other use cases. Please share your thoughts and experiences.