Indexing all the things:
Building your search
engine in Python
Joe Cabrera
@greedoshotlast
Joe Cabrera
Hi, I’m
● Senior Backend Engineer at Jopwell
● Python Programmer since 2009
● Building scalable search and backend
systems for about 2 years
● Author of various open source Python
projects
Our database setup
Trying to find Carmen Sandiego in SQL
● We could start by using LIKE with wildcards ~ 91 sec / 1M rec, low accuracy
SELECT * FROM profile JOIN profile_location JOIN location WHERE
first_name like ‘%Carmen%’ and last_name like ‘%Sandiego%’
● But wait we could also use full-text search ~ 8 min / 1 M rec, higher accuracy
SELECT * FROM profiles JOIN profile_location JOIN location WHERE
first_name || ‘ ‘ || last_name @@ ‘Carmen Sandiego’
Great, but...
● MySQL has very limited support for full-text search
● Custom features may not be supported if you are using Postgres RDS
● You start getting lots of long custom SQL queries
● We’ll going to have to manage our own database sharding
Enter Elasticsearch
● Built on-top of the Lucene search library
● Designed to be distributed
● Full-text indexing and search engine
● Features a common interface: JSON over HTTP
{
"doc" : {
"first_name": "Carmen",
"last_name": "Sandiego",
"locations": [
"New York",
"London",
"Tangier"
],
"location_id": [
1,
2,
3
]
}
}
def index_single_doc(field_names, profile):
index = {}
for field_name in field_names:
field_value = getattr(profile, field_name)
index[field_name] = field_value
return index
Flattening our documents
location_names = []
location_ids = []
for p in profile.locations.all():
location_names.append(str(p))
location_ids.append(p.id)
What about data in related tables?
Indexing our document into Elasticsearch
def add_doc(self, data, id=doc_id):
es_instance = Elasticsearch('https://my_elasticsearchserver')
es_instance.index(index='my-index', doc_type='db-text',id=doc_id, body=data, refresh=True)
Getting the data back out of Elasticsearch
● We’ll first need to perform our query to Elasticsearch
● Then grab the doc ids from the search results
● Use the doc ids to load the profiles from our database for the final search
result response
query_json = {'query': {'simple_query_string': {'query': 'Carmen Sandiego',
'fields':['first_name', 'last_name']}}}
es_results = es_instance.search(index=self.index,
body=query_json,
size=limit,
from_=offset)
Performing our query
{
"took" : 63,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "my-new-index",
"_type" : "db-text",
"_id" : "1",
"sort": [0],
"_score" : null,
"_source": {"first_name": "Carmen", "last_name":"Sandiego","locations": ["New York",
"London", "Tangier"], "location_id": [1, 2, 3]}
}
]
}
}
search_results = []
for _id in raw_ids:
try:
search_results.append(Profile.objects.get(pk=_id))
except:
pass
return search_results
Populating the search results
How do we make Elasticsearch production ready?
Using celery to distribute the task of indexing
● Celery is a distributed task queuing system
● Since indexing is a memory-bound task we don’t want it tying up server
resources
● We’ll break up the task of indexing every one of our new documents initially
into Elasticsearch into a separate task controlled by a larger master task
● New documents can be added incremental to our existing index by firing off a
separate task
from celery import group, task
@task
def index_all_docs():
...
group(process_doc.si(profile_id) for profile_id in profile_ids)()
@task
def process_doc(profile_id):
How do we keep these datastores in sync?
def save(self, *args, **kwargs):
super(Profile, self).save(*args, **kwargs)
celery.current_app.send_task('search_indexer.add_doc'
(self.id,))
Syncing data to Elasticsearch
Great, but what about partial updates?
def update_doc(self, doc_id, data):
es_instance = Elasticsearch('https://my_elasticsearchserver')
es_instance.update(index='my-new-index', doc_type='db-text', id=doc_id,
body={'doc': json.loads(data)}, refresh=True)
Resources
● Code examples from today - http://bit.ly/python-search
● Elasticsearch-py - https://github.com/elastic/elasticsearch-py
● Elasticsearch official docs - https://www.elastic.co/guide/index.html
● Celery - https://github.com/celery/celery/
Thank you!
watch @greedoshotlast for these slides

Indexing all the things: Building your search engine in python

  • 1.
    Indexing all thethings: Building your search engine in Python Joe Cabrera @greedoshotlast
  • 2.
    Joe Cabrera Hi, I’m ●Senior Backend Engineer at Jopwell ● Python Programmer since 2009 ● Building scalable search and backend systems for about 2 years ● Author of various open source Python projects
  • 3.
  • 4.
    Trying to findCarmen Sandiego in SQL ● We could start by using LIKE with wildcards ~ 91 sec / 1M rec, low accuracy SELECT * FROM profile JOIN profile_location JOIN location WHERE first_name like ‘%Carmen%’ and last_name like ‘%Sandiego%’ ● But wait we could also use full-text search ~ 8 min / 1 M rec, higher accuracy SELECT * FROM profiles JOIN profile_location JOIN location WHERE first_name || ‘ ‘ || last_name @@ ‘Carmen Sandiego’
  • 5.
    Great, but... ● MySQLhas very limited support for full-text search ● Custom features may not be supported if you are using Postgres RDS ● You start getting lots of long custom SQL queries ● We’ll going to have to manage our own database sharding
  • 6.
    Enter Elasticsearch ● Builton-top of the Lucene search library ● Designed to be distributed ● Full-text indexing and search engine ● Features a common interface: JSON over HTTP
  • 7.
    { "doc" : { "first_name":"Carmen", "last_name": "Sandiego", "locations": [ "New York", "London", "Tangier" ], "location_id": [ 1, 2, 3 ] } }
  • 8.
    def index_single_doc(field_names, profile): index= {} for field_name in field_names: field_value = getattr(profile, field_name) index[field_name] = field_value return index Flattening our documents
  • 9.
    location_names = [] location_ids= [] for p in profile.locations.all(): location_names.append(str(p)) location_ids.append(p.id) What about data in related tables?
  • 10.
    Indexing our documentinto Elasticsearch def add_doc(self, data, id=doc_id): es_instance = Elasticsearch('https://my_elasticsearchserver') es_instance.index(index='my-index', doc_type='db-text',id=doc_id, body=data, refresh=True)
  • 11.
    Getting the databack out of Elasticsearch ● We’ll first need to perform our query to Elasticsearch ● Then grab the doc ids from the search results ● Use the doc ids to load the profiles from our database for the final search result response
  • 12.
    query_json = {'query':{'simple_query_string': {'query': 'Carmen Sandiego', 'fields':['first_name', 'last_name']}}} es_results = es_instance.search(index=self.index, body=query_json, size=limit, from_=offset) Performing our query
  • 13.
    { "took" : 63, "timed_out": false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : null, "hits" : [ { "_index" : "my-new-index", "_type" : "db-text", "_id" : "1", "sort": [0], "_score" : null, "_source": {"first_name": "Carmen", "last_name":"Sandiego","locations": ["New York", "London", "Tangier"], "location_id": [1, 2, 3]} } ] } }
  • 14.
    search_results = [] for_id in raw_ids: try: search_results.append(Profile.objects.get(pk=_id)) except: pass return search_results Populating the search results
  • 15.
    How do wemake Elasticsearch production ready?
  • 16.
    Using celery todistribute the task of indexing ● Celery is a distributed task queuing system ● Since indexing is a memory-bound task we don’t want it tying up server resources ● We’ll break up the task of indexing every one of our new documents initially into Elasticsearch into a separate task controlled by a larger master task ● New documents can be added incremental to our existing index by firing off a separate task
  • 17.
    from celery importgroup, task @task def index_all_docs(): ... group(process_doc.si(profile_id) for profile_id in profile_ids)() @task def process_doc(profile_id):
  • 18.
    How do wekeep these datastores in sync?
  • 19.
    def save(self, *args,**kwargs): super(Profile, self).save(*args, **kwargs) celery.current_app.send_task('search_indexer.add_doc' (self.id,)) Syncing data to Elasticsearch
  • 20.
    Great, but whatabout partial updates?
  • 21.
    def update_doc(self, doc_id,data): es_instance = Elasticsearch('https://my_elasticsearchserver') es_instance.update(index='my-new-index', doc_type='db-text', id=doc_id, body={'doc': json.loads(data)}, refresh=True)
  • 22.
    Resources ● Code examplesfrom today - http://bit.ly/python-search ● Elasticsearch-py - https://github.com/elastic/elasticsearch-py ● Elasticsearch official docs - https://www.elastic.co/guide/index.html ● Celery - https://github.com/celery/celery/
  • 23.