Finding Love With MongoDB

  { name    : "Oliver Dodd",
    email   : "oliver.dodd@gmail.com",
    twitter : "01001111"
  }
Traditional Search


  Unidirectional User Defined Criteria
eHarmony Matching


  Bidirectional User Defined Criteria
Matching Overview




                        Potential Match Finder                                           Machine Learned Matching           Match Delivery




Photo	
  Credits	
  
Magnifying	
  glass:	
  andercismo	
  @	
  h7p://www.flickr.com/photos/andercismo/	
  
Machine	
  learning:	
  University	
  of	
  Maryland	
  Press	
  Releases	
  @	
  h7p://www.flickr.com/photos/umdnews/	
  
Mailman:	
  h7p://www.flickr.com/photos/noizephotography/	
  
Potential Match Generator


  •  Find candidates that meet user’s
     preferences.

  •  Ensure user doesn’t violate each
     candidate’s preferences.

  •  Discard pairings that violate Compatibility
     Models.

  •  Do this as fast as possible.
Legacy “Potential Match Generator”
Redesign


  Requirements for a new data store

     –  Centralized
     –  Scalable
     –  Automagical
     –  Easy to maintain
     –  Fast, multi-attribute searches
New ”Potential Match Generator”
Why MongoDB?


  •  Scalability

  •  Built in sharding and replication

  •  Autobalancing

  •  Rich, complex queries
Why MongoDB?




               MongoDB is web scale.
Wins


  •  Deploy new instances on demand.
       –  No need to load a local database.


  •  Adding replicas is easy and fast.

  •  Fast queries when isolated to a shard.

  •  Flexible schema
       –  No more reloading for minor data model changes.


  •  Built-in iterative fetching.
Losses


  •  No schema = larger footprint.


  •  Traditional DBAs can’t help (without training).

  •  Aggregation queries are drastically different.

  •  Initial configuration can be a long, manual
     process.
Protips
Use Real Queries




Turn on the fire hose
    When testing or even evaluating, use production data and
    queries.	
  




                        photo by Official U.S. Navy Imagery on Flickr
Use Real Queries




Unleash the Chaos Monkey
    Kill your own mongod instances to ensure your cluster and
    applications continue to function normally.                         	
  




                    photo by dboy @ http://www.flickr.com/photos/dannyboyster/
Minimize


  Minify property names.
      –  In Java, use Morphia for mapping or Salat in Scala
           (also good for queries but we developed our own generic Query API)
      –  Use one or two characters per property name.


  Consider retrieving full objects from another
  collection or data store, storing only what you
  absolutely need for your queries in the search
  store.
      –  On a related note, cache full objects; cache query results only if
         your queried attributes are small in number.
Indexes


  When performing large, variable, multi-
  attribute searches, have a decent number of
  them. Cover the major types of queries and
  the worst performing outliers.

      –  What is present in every query?


      –  What are the best performing attributes when present?

      –  What should my index look like when no high performing
         attributes appear in the query?
Indexes


  Omit ranges unless they are absolutely critical;
  if needed, put them at the end.
      –  Can I replace this with an $in clause?

      –  Can this be prioritized in its own index?

      –  Should there be versions of this index with and without this
         particular attribute?

      –  Will the appearance of this attribute in the index give me any
         speed advantage over inspecting the full object?
Indexes


  Ordering is very, very important.
      –  Attributes for which a user can only have a single value
         should appear towards the top of the index.

      –  Attributes that depend on the values of another attribute
         should appear in immediate succession.

      –  Again, put ranges at the bottom. If multiple ranges are
         necessary, ensure that they appear in order of their ability to
         reduce the working set.

      The order of fields in an index should be:
          First, fields on which you will query for exact values.
          Second, fields on which you will sort.
          Finally, fields on which you will query for a range of values.
                             Eric@MongoLab - http://blog.mongolab.com/2012/06/cardinal-ins/   	
  
Indexes


  Analyze slow queries to find out what attributes
  you can capitalize on.

  When building a compound index, don’t include
  fields that only appear in $or queries as part of
  multi-attribute queries.
          db.toasters.find({
             slots: 4,
             canBagel: true,
             $or: [
               { material: "stainless-steel"},
               { price: {$lte: 50}},
             ]
          })
Queries – Ranges


  Translate "between" queries to in clauses when
  dealing with discrete values.

      $and: [
         {a: { $gte: 0}},
         {a: { $lte: 5}}
      ]

      becomes


      a: { $in: [0,1,2,3,4,5]}
Attributes - Decrease Granularity




  birthdate => birthyear

  floats => ints

  number _of_items => has_items?
Sharding


  •  Try to isolate queries to a particular shard.

  •  Ensure that your data and indexes can fit
     entirely in memory.

  •  If certain attributes ALWAYS appear in the
     query and, in combination, give you a large
     number of well distributed data partitions,
     consider making them the shard key.
We’re Hiring




               h7p://www.eharmony.com/about/careers	
  

Finding Love with MongoDB

  • 1.
    Finding Love WithMongoDB { name : "Oliver Dodd", email : "oliver.dodd@gmail.com", twitter : "01001111" }
  • 2.
    Traditional Search Unidirectional User Defined Criteria
  • 3.
    eHarmony Matching Bidirectional User Defined Criteria
  • 4.
    Matching Overview Potential Match Finder Machine Learned Matching Match Delivery Photo  Credits   Magnifying  glass:  andercismo  @  h7p://www.flickr.com/photos/andercismo/   Machine  learning:  University  of  Maryland  Press  Releases  @  h7p://www.flickr.com/photos/umdnews/   Mailman:  h7p://www.flickr.com/photos/noizephotography/  
  • 5.
    Potential Match Generator •  Find candidates that meet user’s preferences. •  Ensure user doesn’t violate each candidate’s preferences. •  Discard pairings that violate Compatibility Models. •  Do this as fast as possible.
  • 6.
  • 7.
    Redesign Requirementsfor a new data store –  Centralized –  Scalable –  Automagical –  Easy to maintain –  Fast, multi-attribute searches
  • 8.
  • 9.
    Why MongoDB? •  Scalability •  Built in sharding and replication •  Autobalancing •  Rich, complex queries
  • 10.
    Why MongoDB? MongoDB is web scale.
  • 11.
    Wins • Deploy new instances on demand. –  No need to load a local database. •  Adding replicas is easy and fast. •  Fast queries when isolated to a shard. •  Flexible schema –  No more reloading for minor data model changes. •  Built-in iterative fetching.
  • 12.
    Losses • No schema = larger footprint. •  Traditional DBAs can’t help (without training). •  Aggregation queries are drastically different. •  Initial configuration can be a long, manual process.
  • 13.
  • 14.
    Use Real Queries Turnon the fire hose When testing or even evaluating, use production data and queries.   photo by Official U.S. Navy Imagery on Flickr
  • 15.
    Use Real Queries Unleashthe Chaos Monkey Kill your own mongod instances to ensure your cluster and applications continue to function normally.   photo by dboy @ http://www.flickr.com/photos/dannyboyster/
  • 16.
    Minimize Minifyproperty names. –  In Java, use Morphia for mapping or Salat in Scala (also good for queries but we developed our own generic Query API) –  Use one or two characters per property name. Consider retrieving full objects from another collection or data store, storing only what you absolutely need for your queries in the search store. –  On a related note, cache full objects; cache query results only if your queried attributes are small in number.
  • 17.
    Indexes Whenperforming large, variable, multi- attribute searches, have a decent number of them. Cover the major types of queries and the worst performing outliers. –  What is present in every query? –  What are the best performing attributes when present? –  What should my index look like when no high performing attributes appear in the query?
  • 18.
    Indexes Omitranges unless they are absolutely critical; if needed, put them at the end. –  Can I replace this with an $in clause? –  Can this be prioritized in its own index? –  Should there be versions of this index with and without this particular attribute? –  Will the appearance of this attribute in the index give me any speed advantage over inspecting the full object?
  • 19.
    Indexes Orderingis very, very important. –  Attributes for which a user can only have a single value should appear towards the top of the index. –  Attributes that depend on the values of another attribute should appear in immediate succession. –  Again, put ranges at the bottom. If multiple ranges are necessary, ensure that they appear in order of their ability to reduce the working set. The order of fields in an index should be: First, fields on which you will query for exact values. Second, fields on which you will sort. Finally, fields on which you will query for a range of values. Eric@MongoLab - http://blog.mongolab.com/2012/06/cardinal-ins/  
  • 20.
    Indexes Analyzeslow queries to find out what attributes you can capitalize on. When building a compound index, don’t include fields that only appear in $or queries as part of multi-attribute queries. db.toasters.find({ slots: 4, canBagel: true, $or: [ { material: "stainless-steel"}, { price: {$lte: 50}}, ] })
  • 21.
    Queries – Ranges Translate "between" queries to in clauses when dealing with discrete values. $and: [ {a: { $gte: 0}}, {a: { $lte: 5}} ] becomes a: { $in: [0,1,2,3,4,5]}
  • 22.
    Attributes - DecreaseGranularity birthdate => birthyear floats => ints number _of_items => has_items?
  • 23.
    Sharding • Try to isolate queries to a particular shard. •  Ensure that your data and indexes can fit entirely in memory. •  If certain attributes ALWAYS appear in the query and, in combination, give you a large number of well distributed data partitions, consider making them the shard key.
  • 24.
    We’re Hiring h7p://www.eharmony.com/about/careers