Successfully reported this slideshow.
Your SlideShare is downloading. ×

Indexing all the things: Building your search engine in python

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 23 Ad
Advertisement

More Related Content

Similar to Indexing all the things: Building your search engine in python (20)

Advertisement

Indexing all the things: Building your search engine in python

  1. 1. Indexing all the things: Building your search engine in Python Joe Cabrera @greedoshotlast
  2. 2. Joe Cabrera Hi, I’m ● Senior Backend Engineer at Jopwell ● Python Programmer since 2009 ● Building scalable search and backend systems for about 2 years ● Author of various open source Python projects
  3. 3. Our database setup
  4. 4. Trying to find Carmen Sandiego in SQL ● We could start by using LIKE with wildcards ~ 91 sec / 1M rec, low accuracy SELECT * FROM profile JOIN profile_location JOIN location WHERE first_name like ‘%Carmen%’ and last_name like ‘%Sandiego%’ ● But wait we could also use full-text search ~ 8 min / 1 M rec, higher accuracy SELECT * FROM profiles JOIN profile_location JOIN location WHERE first_name || ‘ ‘ || last_name @@ ‘Carmen Sandiego’
  5. 5. Great, but... ● MySQL has very limited support for full-text search ● Custom features may not be supported if you are using Postgres RDS ● You start getting lots of long custom SQL queries ● We’ll going to have to manage our own database sharding
  6. 6. Enter Elasticsearch ● Built on-top of the Lucene search library ● Designed to be distributed ● Full-text indexing and search engine ● Features a common interface: JSON over HTTP
  7. 7. { "doc" : { "first_name": "Carmen", "last_name": "Sandiego", "locations": [ "New York", "London", "Tangier" ], "location_id": [ 1, 2, 3 ] } }
  8. 8. def index_single_doc(field_names, profile): index = {} for field_name in field_names: field_value = getattr(profile, field_name) index[field_name] = field_value return index Flattening our documents
  9. 9. location_names = [] location_ids = [] for p in profile.locations.all(): location_names.append(str(p)) location_ids.append(p.id) What about data in related tables?
  10. 10. Indexing our document into Elasticsearch def add_doc(self, data, id=doc_id): es_instance = Elasticsearch('https://my_elasticsearchserver') es_instance.index(index='my-index', doc_type='db-text',id=doc_id, body=data, refresh=True)
  11. 11. Getting the data back out of Elasticsearch ● We’ll first need to perform our query to Elasticsearch ● Then grab the doc ids from the search results ● Use the doc ids to load the profiles from our database for the final search result response
  12. 12. query_json = {'query': {'simple_query_string': {'query': 'Carmen Sandiego', 'fields':['first_name', 'last_name']}}} es_results = es_instance.search(index=self.index, body=query_json, size=limit, from_=offset) Performing our query
  13. 13. { "took" : 63, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : null, "hits" : [ { "_index" : "my-new-index", "_type" : "db-text", "_id" : "1", "sort": [0], "_score" : null, "_source": {"first_name": "Carmen", "last_name":"Sandiego","locations": ["New York", "London", "Tangier"], "location_id": [1, 2, 3]} } ] } }
  14. 14. search_results = [] for _id in raw_ids: try: search_results.append(Profile.objects.get(pk=_id)) except: pass return search_results Populating the search results
  15. 15. How do we make Elasticsearch production ready?
  16. 16. Using celery to distribute the task of indexing ● Celery is a distributed task queuing system ● Since indexing is a memory-bound task we don’t want it tying up server resources ● We’ll break up the task of indexing every one of our new documents initially into Elasticsearch into a separate task controlled by a larger master task ● New documents can be added incremental to our existing index by firing off a separate task
  17. 17. from celery import group, task @task def index_all_docs(): ... group(process_doc.si(profile_id) for profile_id in profile_ids)() @task def process_doc(profile_id):
  18. 18. How do we keep these datastores in sync?
  19. 19. def save(self, *args, **kwargs): super(Profile, self).save(*args, **kwargs) celery.current_app.send_task('search_indexer.add_doc' (self.id,)) Syncing data to Elasticsearch
  20. 20. Great, but what about partial updates?
  21. 21. def update_doc(self, doc_id, data): es_instance = Elasticsearch('https://my_elasticsearchserver') es_instance.update(index='my-new-index', doc_type='db-text', id=doc_id, body={'doc': json.loads(data)}, refresh=True)
  22. 22. Resources ● Code examples from today - http://bit.ly/python-search ● Elasticsearch-py - https://github.com/elastic/elasticsearch-py ● Elasticsearch official docs - https://www.elastic.co/guide/index.html ● Celery - https://github.com/celery/celery/
  23. 23. Thank you! watch @greedoshotlast for these slides

×