Indexing all the things: Building your search engine in python

Indexing all the things:
Building your search
engine in Python
Joe Cabrera
@greedoshotlast

Joe Cabrera
Hi, I’m
● Senior Backend Engineer at Jopwell
● Python Programmer since 2009
● Building scalable search and backend
systems for about 2 years
● Author of various open source Python
projects

Trying to find Carmen Sandiego in SQL
● We could start by using LIKE with wildcards ~ 91 sec / 1M rec, low accuracy
SELECT * FROM profile JOIN profile_location JOIN location WHERE
first_name like ‘%Carmen%’ and last_name like ‘%Sandiego%’
● But wait we could also use full-text search ~ 8 min / 1 M rec, higher accuracy
SELECT * FROM profiles JOIN profile_location JOIN location WHERE
first_name || ‘ ‘ || last_name @@ ‘Carmen Sandiego’

Great, but...
● MySQL has very limited support for full-text search
● Custom features may not be supported if you are using Postgres RDS
● You start getting lots of long custom SQL queries
● We’ll going to have to manage our own database sharding

Enter Elasticsearch
● Built on-top of the Lucene search library
● Designed to be distributed
● Full-text indexing and search engine
● Features a common interface: JSON over HTTP

{
"doc" : {
"first_name": "Carmen",
"last_name": "Sandiego",
"locations": [
"New York",
"London",
"Tangier"
],
"location_id": [
1,
2,
3
]
}
}

def index_single_doc(field_names, profile):
index = {}
for field_name in field_names:
field_value = getattr(profile, field_name)
index[field_name] = field_value
return index
Flattening our documents

location_names = []
location_ids = []
for p in profile.locations.all():
location_names.append(str(p))
location_ids.append(p.id)
What about data in related tables?

Indexing our document into Elasticsearch
def add_doc(self, data, id=doc_id):
es_instance = Elasticsearch('https://my_elasticsearchserver')
es_instance.index(index='my-index', doc_type='db-text',id=doc_id, body=data, refresh=True)

Getting the data back out of Elasticsearch
● We’ll first need to perform our query to Elasticsearch
● Then grab the doc ids from the search results
● Use the doc ids to load the profiles from our database for the final search
result response

query_json = {'query': {'simple_query_string': {'query': 'Carmen Sandiego',
'fields':['first_name', 'last_name']}}}
es_results = es_instance.search(index=self.index,
body=query_json,
size=limit,
from_=offset)
Performing our query

{
"took" : 63,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "my-new-index",
"_type" : "db-text",
"_id" : "1",
"sort": [0],
"_score" : null,
"_source": {"first_name": "Carmen", "last_name":"Sandiego","locations": ["New York",
"London", "Tangier"], "location_id": [1, 2, 3]}
}
]
}
}

search_results = []
for _id in raw_ids:
try:
search_results.append(Profile.objects.get(pk=_id))
except:
pass
return search_results
Populating the search results

How do we make Elasticsearch production ready?

Using celery to distribute the task of indexing
● Celery is a distributed task queuing system
● Since indexing is a memory-bound task we don’t want it tying up server
resources
● We’ll break up the task of indexing every one of our new documents initially
into Elasticsearch into a separate task controlled by a larger master task
● New documents can be added incremental to our existing index by firing off a
separate task

from celery import group, task
@task
def index_all_docs():
...
group(process_doc.si(profile_id) for profile_id in profile_ids)()
@task
def process_doc(profile_id):

How do we keep these datastores in sync?

def save(self, *args, **kwargs):
super(Profile, self).save(*args, **kwargs)
celery.current_app.send_task('search_indexer.add_doc'
(self.id,))
Syncing data to Elasticsearch

Great, but what about partial updates?

def update_doc(self, doc_id, data):
es_instance = Elasticsearch('https://my_elasticsearchserver')
es_instance.update(index='my-new-index', doc_type='db-text', id=doc_id,
body={'doc': json.loads(data)}, refresh=True)

Resources
● Code examples from today - http://bit.ly/python-search
● Elasticsearch-py - https://github.com/elastic/elasticsearch-py
● Elasticsearch official docs - https://www.elastic.co/guide/index.html
● Celery - https://github.com/celery/celery/

Thank you!
watch @greedoshotlast for these slides

Indexing all the things: Building your search engine in python

More Related Content

Similar to Indexing all the things: Building your search engine in python

Recently uploaded

Indexing all the things: Building your search engine in python