3. big dating at scale
3B+ potential matches daily ~ 25+ TB of data
60M+ multi-attribute queries daily looking across 250+
212M+ photos ~ 15+ TB of data
4B+ relationship questionnaires ~ 25+ TB of data
4. the big win for product
Decreased the processing time to match by 95%,
from 2+ weeks to 12 hours
on 3B+ potential matches/day
30% increase in 2-way communications
50% increase in paid subs
60% increase in unique visitors
No schema = larger footprint
Aggregation queries are different
Initial configuration can be long, manual process
22. lessons learned
Turn on the Firehose
Unleash the Chaos Monkey
Engage MongoDB, Inc. early – dev to production
Try to isolate your queries to a shard
Run in shadow mode
23. what’s next
New matching use cases:
Globalization and Localization of eH site
Careers by eHarmony
Internet of Things “Compatible”
New use cases within eHarmony:
Real-time geo location based matching service
So here’s the agenda for today
First, I’ll talk about our compatibility matching system – the key to generating all those happy couples and satisfied marriages I talked about before.
Then I’ll talk to you about the old system, how it was architected, and where we ran into problems.
Then I’ll talk about the new system – our requirements, the technologies we evaluated and why we selected MongoDB.
Finally, I’ll discuss some of the lessons we learned during MongoDB migration and and the new potential use cases we’re considering MongoDB for.
eHarmony’s secret sauce is our Compatibility matching system
It consists of a sophisticated 3-tier process.
Compatibility matching identifies potential compatible matches based on user core compatibility from the 29 dimensions of psychology and personality traits AND also based on user preferences
Affinity Matching predicts the probability of communication between 2 people. That is, will these two people want to connect. Even if 2 people are very compatible because they have similar beliefs or interests; however, that doesn’t mean that they want to connect because of other reasons. For example, they could be completely in different age groups, or they live 3K miles apart or may not be attracted to each other.
3. Match distribution helps to ensure we deliver the right matches to the right users at the right time – and to as many users as possible across our entire network.
For the purpose of this talk I’ll stay mostly on the Compatibility matching system, allowing us to focus more on the usage of MongoDB.
The Compatibility matching component is a two step process
Traditional search is unidirectional
To understand that let’s take a look at Nikki as an example.
In this scenario, Nikkie is in the market looking for toasters.
All that matters in the one-way search is to return the toasters that meet the criteria that Nikki had specified and whichever toaster she gets to take home, the Poor Toaster has no choice in this matter.
But dating is more complex than this especially when we’re trying to create a meaningful, romantic connection between 2 people.
Dating is bidirectional – both people need to want to be with one another
At eHarmony we developed a bidirectional system to make sure that user preferences are met both ways or mutually
Take Nikki as an example again. This time she’s not looking for toasters on Amazon. She’s on eHarmony. We also have some other eHarmony users for example Jeb, Jon and Nick.
First, we need to consider only those that meet Nikki’s criteria. In this case, that’s only Jeb and Jon.
For us to have a match, Nikki also needs to meet the criteria specified by Jon or Jeb.
In this case, that’s only Jon.
What are some of these criteria that we’re talking about, these are simple things like age, distant, religion, ethnicity, income/education.
This completes the first part of the matching system
So why was MongoDB selected?
It provided the Best of both worlds – it supported fast, multi-attribute searches and powerful indexing features with dynamic, and flexible data model
Supported Auto-Scaling – Anytime we want to handle more load, we just add a shard to our sharded cluster and if the shard is getting hot, we add another replica to our replica sets.
Built-in sharding – so we can scale out our big data horizontally running on top of commodity machines and still maintaining high-performance throughput.
Auto-balancing of data within the shard or all shards automatically and seamlessly so the client application doesn’t have to worry about the internals of how the data is stored and managed
There were also other benefits including:
Ease of management – This is very important to us when we have a small Ops team managing 1K+ servers and 2K+ other devices in our primary datacenter
Open-Source + Commercial Entity –It’s open source with good community support and enterprise professional support from MongoDB team.
In Q1 2013, we deploy 12 Mongos with 3 shards (3x4)
In Q2 2014, we increase 18 Mongos with 3 shards (3x6)
Query caching solution to maximize the throughput and performance
So what are the tradeoff’s when deploying MongoDB
1. MongoDB is a schema free datastore. So the data format or data representation is repeated in every document in the collection, therefore, it requires a lot more storage space, which translates to a larger footprint.
3. Aggregation queries in MongoDB are very different than traditional SQL aggregation functions. That results in a paradigm shift from DBA focus to Engineering focus.
4. Lastly, the initial configuration/migration can be long, and manual process due to lack of automated tooling on MongoDB side. I was told that the new version of MMS dashboard will include automated provisioning, and configuration, software upgrade and point in time recovery/backup. This is a fantastic news for the Mongo community.
There were a few key lessons we learned during the MongoDB migration:
1. Turn on the Firehose
- When testing or even evaluating, use production data and queries to ensure that you have apple to apple comparison in terms of performance and scalability metrics
2. Unleash the Chaos Monkey (LT)
During your load testing, you kill your mongod instances to ensure your cluster and applications continue to function normally.
3. Involve the MongoDB team from the start even during the POC
Best architectural guidance and support related to data modeling, indexing strategy, optimized queries, helping you with Mongos production topology with proper monitoring (MMS) and integration with your internal monitoring system.
4. Select a good shard key such that most of your queries can be isolated to a shard, so that mongos does not have wait to collect results from all shards.
5. Run in shadow mode:
- The matching infrastructure is based on the event-driven Service-Oriented Architecture (SOA) model
- It’s easy for us to have 2 CMP clusters running from the same distributed messaging System.
- Basically the messages were replicated to both of the clusters (Postgres and MongoDB) from real production traffic, so we were able to optimize the MongoDB (shard, key, query indices) without affecting our production users and once we certify the solution, we simply switched to the MongoDB based cluster.
Tuning - (Shard Key, Query Indices) / Enhance Code / Increase Mongo Cluster capacity
What’s next to come from eHarmony:
Our core mission is to make people lives better, happier whether to find the love of their lives across multiple locales and languages or to help them finding the right job.
Our online dating sites in AU and UK have been very profitable so we plan to expand that success to 20 other countries in the next few years, starting with English speaking countries, Spanish, French and other languages.
We’re also working on the new job compatibility vertical, we called it “Careers by eHarmony”. And we plan to launch the Beta site in December of this year. We know for a long time that it’s really hard to make the marriage to work if you’re not happy with your job. 65+% of people in America are not happy with the job they currently have and they can be, if they get matched with the right job based on the culture of the company, personality to whom they will be reporting to in addition to their skills.
Here are some potential use cases we may consider using MongoDB for :
Real-time geo-location based matching services leveraging on the MongoDB spatial indexes and queries
2. We also explore MongoDB for one of our datastores for the new Jobs Compatibility vertical
Here are some of our major technology investments to help us solving complex engineering problems and providing long-term maintainability, scalability and innovation at eHarmony.
1. We use a lot of Scala as the functional programming language to implement our CMS, and Affinity matching models.
2. We heavily use Hadoop/Hive on top of Yarn for our data mining, massive data processing, and R (Revolution) as the programming language for data science and predictive analytics in our machine learning models.
3. We use Node.js/HTML5/Backbone to implement our public facing eHarmony web applications for both Mobile Web and Desktop.
1. We have lots of open positions right now, so if you’re interested to be part of a great cause, great culture and working on the coolest and most cutting-edge technologies
reach out to me directly on @LinkedIn or apply directly @jobs.
That would require too much detail right now given a short-time that we have, let’s connect afterward to discuss.