Indexing all the things: Building your search engine in python
1. Indexing all the things:
Building your search
engine in Python
Joe Cabrera
@greedoshotlast
2. Joe Cabrera
Hi, I’m
● Senior Backend Engineer at Jopwell
● Python Programmer since 2009
● Building scalable search and backend
systems for about 2 years
● Author of various open source Python
projects
4. Trying to find Carmen Sandiego in SQL
● We could start by using LIKE with wildcards ~ 91 sec / 1M rec, low accuracy
SELECT * FROM profile JOIN profile_location JOIN location WHERE
first_name like ‘%Carmen%’ and last_name like ‘%Sandiego%’
● But wait we could also use full-text search ~ 8 min / 1 M rec, higher accuracy
SELECT * FROM profiles JOIN profile_location JOIN location WHERE
first_name || ‘ ‘ || last_name @@ ‘Carmen Sandiego’
5. Great, but...
● MySQL has very limited support for full-text search
● Custom features may not be supported if you are using Postgres RDS
● You start getting lots of long custom SQL queries
● We’ll going to have to manage our own database sharding
6. Enter Elasticsearch
● Built on-top of the Lucene search library
● Designed to be distributed
● Full-text indexing and search engine
● Features a common interface: JSON over HTTP
8. def index_single_doc(field_names, profile):
index = {}
for field_name in field_names:
field_value = getattr(profile, field_name)
index[field_name] = field_value
return index
Flattening our documents
9. location_names = []
location_ids = []
for p in profile.locations.all():
location_names.append(str(p))
location_ids.append(p.id)
What about data in related tables?
11. Getting the data back out of Elasticsearch
● We’ll first need to perform our query to Elasticsearch
● Then grab the doc ids from the search results
● Use the doc ids to load the profiles from our database for the final search
result response
14. search_results = []
for _id in raw_ids:
try:
search_results.append(Profile.objects.get(pk=_id))
except:
pass
return search_results
Populating the search results
15. How do we make Elasticsearch production ready?
16. Using celery to distribute the task of indexing
● Celery is a distributed task queuing system
● Since indexing is a memory-bound task we don’t want it tying up server
resources
● We’ll break up the task of indexing every one of our new documents initially
into Elasticsearch into a separate task controlled by a larger master task
● New documents can be added incremental to our existing index by firing off a
separate task
17. from celery import group, task
@task
def index_all_docs():
...
group(process_doc.si(profile_id) for profile_id in profile_ids)()
@task
def process_doc(profile_id):