Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit

Making Reddit Search
Relevant and Scalable
Anupama Joshi
Senior Engineering Manager, Search
Jerry Bao
Senior Software Engineer, Search

Agenda
• What is Reddit?
• Search Architecture
• Improving our Relevance
• The History of Search @ Reddit
• Scaling our Infrastructure
• Q&A

What is Reddit?
Reddit is a network of communities where
individuals can find experiences built
around their interests, hobbies and
passions
It’s where people converse about the
things that are most important to them

Bring community and
belonging to everyone
Our mission

Reddit by the numbers
Alexa Rank (US/World)
MAU
Communities
Posts per day
Comments per day
Votes per day
Searches per day
5th/18th
400M+
1M+
440K+
3.5M+
82M+
68M+

So, what are
we doing with
all that
power?

Dog getting love
51.2k points (95% upvoted)
Cat Fist Bumping
137.1k points (90% upvoted)
817.2k views

Wait, it’s not just
cat/dog pictures!

Community > Content > Individual
● Authenticity
● Creative freedom
● Empathy @ scale
● Belonging
● Being heard

r/assistance
Empathy and support at scale

Reddit’s Community of Support

None of that matters if you can’t
FIND the content! So let’s talk about
Search...

User Retention
● New users who
searched are 300%
more likely to come
back between D1 to
D14
● > 50% of all mobile
users search

Show and Tell: A better subreddit search
Challenge: Redditors are very creative in their subreddit naming (e.g. r/superbowl
is about superb owl pictures) which whilst fun, poses a challenge for discovery.
Answer: faceted search on posts!

Result: A better subreddit search

Show and Tell: Better Post Search
● Post search with phrase matching of selftext
The challenge: What about images and link posts?
Answer - Comments
● Comments are important but which comments are most relevant to the post?
● How do we separate the signal from the noise?
Answer - HVT
● HVTs are the highest scoring tf-idf terms from comment sections.
● Index and match on these HVTs along with post selftexts and titles.

Result: Better Post Search
Qualitatively, we saw some users notice almost immediately when we first introduced HVTs.
For some queries, the difference is
quite stark. The following are
search results for the query
‘shabooya’. Note how ‘shabooya’
doesn’t appear anywhere in the title
or the body of the first three post
results, but you can see the phrase
show up in the comments.

Result: Better Post Search
● Post click through rate (CTR) (+3.15%),
● Relevancy ranking for navigational searches (MRR) (+4.01%)
● Search experience improvements for navigational searches due to increased
recall on posts with poor title or body text

Take It to the Next Level: Improve Search Relevance
● Learn from the users click statistics to automatically generate a relevancy
model
● Rerank Search results based on aggregated Click Signal weights that users
click higher on search results for a given query
○ Stream user events in Solr/Fusion cluster
○ Spark Jobs to aggregate click data
○ Use output from the aggregated signal to boost the search results

Result: Post search relevance using signals
7.5 % Increase in CTR12.5 % increase in MRR

Result: Subreddit search relevance using signals

Head-Tail Analysis
● Spelling corrections.
● Tail Query Rewriting.
● Specific Dictionary based Rewriting

Head-Tail Analysis
A tail query like “lot of credit card debit” would be rewritten to produce better relevant results.

Trending Searches
● Reddit can attribute week-over-week DAU
growth to external events, like game
releases, movie releases, and cultural
events (reference).
● We see similar upticks in searches based on
these events (reference).
● We believe that we can increase search
engagement and time on site by leveraging
these signals to highlight trending queries
to users when they search on Reddit.

NSFW Categorization
● Develop NSFW classification criteria
● Query Time classification based content filtering.
● Results boosting/reordering based on classification(boost or filter results
based on knowing the query does/does not have NSFW intent)
● Look at the NSFW results in recall
● Look at the NSFW results people clicked
● Try open source Tensorflow libraries for auto detection of NSFW which is not
marked NSFW

Related Searches
● Train a collaborative filtering matrix decomposition recommender using
SparkML's Alternating Least Squares (ALS) to batch compute query-query
similarities
● Related Searches backend based on Collaborative Filtering & Co Occurrence
Counting Algorithm via Temporal Proximity
● Collaborative filtering based recommender systems are a popular technique
applied for movie recommendations at Netflix, or product recommendations in
e-commerce sites like Amazon

Related Searches
● Dynamic temporal buckets as source of data.
● All pairs irrespective of number of distinct queries in Session
● Length & temporal distance metrics to help with boosting recommendation.
● Intuitive & easily explainable.
● Scales extremely well for building pluggable logic & adding more dimensions.

Related Searches
*Query* —> *Related Searches*
*learn* —> `learn programming`, `learn python`, `learn javascript`, `learn French`, `learn java`, `piano learn`
*cats* —> `cat`, àww`, `dogs`, `r/cats`, `r/comics`, `kittens`, `funny`, `pets`
*dogs* —> `dog`, `dogs`, àww`, ìsle of dogs`, ìsle of dogs discussion`, `cute dogs`, `pets`
*infinity war* —> àvengers`, `piracy`, ìnfinity war stream`, ìnfinity war hd`, àvengers infinity war`, àvengers infinity
war stream`, `deadpool2`, ìnfinity war torrent`
*coming out*. —> `gay`, `lgbt`
*makeup*. —> `beauty`, `make up`, `makeupaddiction`, `skincare`, `foundation`, èyeshadow`, `wedding`
*keto* —> `snacks`, `r/keto`, `r/progresspics`, `xxketo`, `keto recipes`, `keto diet`, `fasting`
*programming* —> `r/politics`, `r/programming`, `programming`, `python`, `coding`, `learnprogramming`, `r/golang`,
`r/programming`
*Cohen* —> `sacha baron cohen`, `sasha baron cohen` `who is america`, `trump`, `jason spencer`, `sacha cohen`,
`sasha cohen`
*photography* —> `photo`, `r/Nikon`, `r/photography`, `camera`, `photos`, àrt`, `r/bestof`, ìnstagram`
*blep*. —> `mlem`

What’s next
● Contextual Query Understanding
○ how context informs query understanding
● Understanding User Intent
○ classifying the query by its interpretation. The interpretation of the query can then be used to
define intent
● Query rewriting and scoping
○ query rewriting technique that improves precision by matching each query segment to the right
attribute
○ query tagging (special case of named-entity recognition (NER))

Reddit Search has an
interesting history...
History of Reddit Search

History of Reddit Search
● 2005 - Steve Huffman, cofounder and now CEO, implements postgres tsearch.
● 2006 - Chris Slowe, founding engineer and now CTO, implements pylucene.
○ “we fixed a bug in the search results ordering” - Steve Huffman ‘06
○ “I made a quick fix to search that I hope helps until we get a chance to really fix it.” - Steve ‘07
● 2008 - David King, first employee and former search engineer, implements Solr.
○ “[David]’s been fixing search and hacking mystery projects in Erlang.” - Alexis Ohanian ‘08
○ “I’ve totally replaced the reddit search function.” - David King ‘08
● 2010 - David King replaces Solr with IndexTank.
○ “We launched a new search engine yesterday. Calm down. It’s okay. I know. You’ve been hurt
before.” - David King ‘10
● 2012 - u/kemitche implements CloudSearch after LinkedIn shut down IndexTank
“Q: Where do you see reddit in 10 years? A: Reddit search might work by then.” - Steve AMA ‘16

Redditors told us how
much they loved
Search...
“Reddit Search is great!” - said no redditor ever

“This image should honestly replace the 503 error (all servers busy) page.” - u/seven0feleven

“Ever since they moved away from scotch tape, I've been able to get irrelevant results in record time.” - u/El_Bandito_Blanquito

In 2017, we set out to
rebuild search from the
ground up!
Rebuilding Search

Our First Cluster
● Create an AMI with Solr and Fusion packages installed
● Spin up servers with custom AMI
● SSH into each server
○ Install Fusion and Solr
○ Edit configuration files
○ Increase file descriptor limit
● Configured in AWS US West

Our First Cluster
Our new cluster was up
and running well! We
immediately started work
on ingesting data and
relevance tuning.

But we ran into a
couple of key issues
when trying to scale
up...
Challenge #1

Issues with Scaling our Solr Cluster
● Adding capacity to our cluster or changing instance types took a lot of
effort
● Adding capacity our cluster meant that we needed to rebalance our
cluster so that our replicas were equally distributed across machines
○ Solr 7+ introduced some basic autoscaling features but lacked
policies to ensure a cluster was properly balanced
○ Rebalancing process was 100% manual
● Cross-region requests cost unnecessary latency
● As a result, our team was very cautious in scaling our cluster until it
was absolutely needed, to reduce the number of times we scaled up

Terraform and Puppet
everything!
Automate all the things!

Terraform + Puppet
● Together they allow us to programmatically make changes to
infrastructure and server configuration quickly
● We can describe how we want servers to be setup
○ Install Java and Solr
○ Mount drives and add user groups/permissions
○ Set up Solr configuration files
● Modifications to servers and infra are reviewable, and revertible
● Rollout changes across our fleet with ease
● “Can you add more servers Jerry??”
○ No problem! One line code change.

Equally Distribute by Availability Zone
subreddits
shard 1
replica 1
solr-01
us-east-1a
replica 2
solr-02
us-east-1b
replica 3
solr-03
us-east-1c
shard 2
replica 1
solr-01
us-east-1a
replica 2
solr-02
us-east-1b
replica 3
solr-03
us-east-1c
shard 3
replica 1
solr-01
us-east-1a
replica 2
solr-02
us-east-1b
replica 3
solr-03
us-east-1c

No More Than 1 Replica From Same Shard
subreddits
shard 1
replica 1
solr-01
us-east-1a
replica 2
solr-02
us-east-1b
replica 3
solr-03
us-east-1c
shard 2
replica 1
solr-01
us-east-1a
replica 2
solr-02
us-east-1b
replica 3
solr-03
us-east-1c
shard 3
replica 1
solr-01
us-east-1a
replica 2
solr-02
us-east-1b
replica 3
solr-03
us-east-1c

Equally Distribute Collection’s Replicas
cluster
solr-01 (us-east-1a)
subreddits
shard 1
replica 1
posts
shard 1
replica 1
posts
shard 2
replica 1
posts
shard 3
replica 1
solr-02 (us-east-1b)
subreddits
shard 1
replica 2
posts
shard 1
replica 2
posts
shard 2
replica 2
solr-03 (us-east-1c)
subreddits
shard 1
replica 3
posts
shard 1
replica 3
posts
shard 2
replica 3
posts
shard 3
replica 2
posts
shard 3
replica 3
subreddits - 1 shard; posts - 2 shards; each shard has 3 replicas

Equally Distribute Cluster’s Replicas
cluster
subreddits
shard 1
replica 1
posts
shard 1
replica 1
posts
shard 3
replica 1
posts
shard 2
replica 1
subreddits
shard 1
replica 2
posts
shard 1
replica 2
subreddits
shard 1
replica 3
posts
shard 1
replica 3
posts
shard 3
replica 2
posts
shard 2
replica 2
posts
shard 3
replica 3
posts
shard 2
replica 3
subreddits - 1 shard; posts - 2 shards; each shard has 3 replicas

Solr Rebalancing Tool
● Applied balancing rules in order
○ Check each shard’s availability zone distribution and replica
distribution
○ Move replicas so that each collection’s replicas are on the most
amount of machines
○ Move replicas so that each machine has the least amount of
replicas possible
● Outputs list of operations to be performed and confirms with user each
replica to move

Cross-Region Latency Improvement
4x faster
queries!

Our cluster was now
scaling easily, but
reindexing all of our
data took many
weeks...
Challenge #2

Indexing Data for Search
● Backfills
○ Pulls data from our datasource
○ Transforms it into the schema we need for indexing
○ Used to add/remove/change field indexing
● Streaming
○ Captures real-time updates so up-to-date information can be
reflected in our indices
○ Transforms data the same way as backfills

Why are fast backfills important?
● Quickly iterate on document schemas
● Test new ways to analyze document fields
● Create multiple clusters of the same data for testing
● Fix data issues rapidly

Hive
● Pulled data from postgres with sqoop into Hive
● A series of transformations to
○ Join thing and data tables
○ Rotate the keys into columns
○ Store the final result as Parquet in S3
● Fusion/Spark fetched S3 files and indexed data into Solr

Issues with v1
● Several weeks to transform data
○ Afraid of changing the schema
● Many stages of transformation, making it hard to debug and figure out
how far upstream data transformation issues were
○ Hard to ensure the end result was correct

Thing Service
● Search Service as the transformer and indexer of data
○ Fetches the latest data from the Thing Service
● Special logic in Thing Service made it easier to handle postgres data
○ Score of links, comments
○ Converting to actual data types (booleans, fullnames)
● Cut backfill time from multiple weeks to a single week with
parallelization

Issues with v2
● Reliant upon a shared production service for what should be an offline
job
○ We’ve pushed the thing service too hard with our backfills,
affecting other services that rely upon it
● Other initiatives highlighted how slow our ingestion could get
○ HVTs (augmenting links with high value tokens from comments)
○ Attempts to index comment data

Spark
● Running our own postgres replicas from wal-e backups in S3
● Spark pulls data directly from postgres and transforms the data
● Can horizontally scale ingestion to be faster
○ Postgres to speed up ingestion of data into Spark
○ Spark to speed up transformation and joining of data
● We can adjust ingestion parallelism by repartitioning in the end
● Cut backfill time significantly from multiple weeks to days

Random 100% CPU
spikes prevented us
from shipping search
new features...
Challenge #3

Redditors Issue Expensive Queries
● High Recall Queries
○ the, would, you, ifs, news, games
● Crazy Queries
○ (AFD+OR+CDU+OR+CSU+OR+FDP+OR+Grünen+OR+SPD+OR+"
Die+Linke"+OR+Energiepolitik+OR+Gesetze~+OR+Kabinetts~+O
R+Regierungs~+OR+Referentenentwurf)+(Energiehandel~+OR+E
nergiemanagement~+OR+Energiepreis~+OR+Energiesteuer~)
● These queries would take multiple seconds to complete, blocking a
significant number of CPU cores in the cluster

Cutting Queries Off
● Utilize timeAllowed in solrconfig.xml to prevent expensive queries
taking up all of your cluster’s resources
○ NOTE: timeAllowed is not a hard cutoff. From the Solr docs:
○ As this check is periodically performed, the actual time for which a
request can be processed before it is aborted would be marginally
greater than or equal to the value of timeAllowed. If the request
consumes more time in other stages, e.g., custom components,
etc., this parameter is not expected to abort the request.

Multi-Cluster Solr Environment
● One cluster per collection
● Hardware Isolation: one collections issues won’t affect other
collections
● Scale each collection independently
● Balancing becomes really simple
○ Each machine has equally distributed number of replicas
○ Ensure AZ and shard awareness

Solr 7.5 Autoscaling
● Solr 7.5 includes new policies that allow us to equally distribute
replicas by
○ Arbitrary properties
○ Collection
○ Cluster
● Turn Solr Scaling into a one step process

Thank you!
Anupama Joshi
anupama@reddit.com
linkedin.com/in/anupamajoshi
Jerry Bao
jerry.bao@reddit.com
linkedin.com/in/thejerrybao
PS: We’re Hiring!
reddit.com/jobs

Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit

Similar to Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit