SlideShare a Scribd company logo
Indexing all the things:
Building your search
engine in Python
Joe Cabrera
@greedoshotlast
Joe Cabrera
Hi, I’m
● Senior Backend Engineer at Jopwell
● Python Programmer since 2009
● Building scalable search and backend
systems for about 2 years
● Author of various open source Python
projects
Our database setup
Trying to find Carmen Sandiego in SQL
● We could start by using LIKE with wildcards ~ 91 sec / 1M rec, low accuracy
SELECT * FROM profile JOIN profile_location JOIN location WHERE
first_name like ‘%Carmen%’ and last_name like ‘%Sandiego%’
● But wait we could also use full-text search ~ 8 min / 1 M rec, higher accuracy
SELECT * FROM profiles JOIN profile_location JOIN location WHERE
first_name || ‘ ‘ || last_name @@ ‘Carmen Sandiego’
Great, but...
● MySQL has very limited support for full-text search
● Custom features may not be supported if you are using Postgres RDS
● You start getting lots of long custom SQL queries
● We’ll going to have to manage our own database sharding
Enter Elasticsearch
● Built on-top of the Lucene search library
● Designed to be distributed
● Full-text indexing and search engine
● Features a common interface: JSON over HTTP
{
"doc" : {
"first_name": "Carmen",
"last_name": "Sandiego",
"locations": [
"New York",
"London",
"Tangier"
],
"location_id": [
1,
2,
3
]
}
}
def index_single_doc(field_names, profile):
index = {}
for field_name in field_names:
field_value = getattr(profile, field_name)
index[field_name] = field_value
return index
Flattening our documents
location_names = []
location_ids = []
for p in profile.locations.all():
location_names.append(str(p))
location_ids.append(p.id)
What about data in related tables?
Indexing our document into Elasticsearch
def add_doc(self, data, id=doc_id):
es_instance = Elasticsearch('https://my_elasticsearchserver')
es_instance.index(index='my-index', doc_type='db-text',id=doc_id, body=data, refresh=True)
Getting the data back out of Elasticsearch
● We’ll first need to perform our query to Elasticsearch
● Then grab the doc ids from the search results
● Use the doc ids to load the profiles from our database for the final search
result response
query_json = {'query': {'simple_query_string': {'query': 'Carmen Sandiego',
'fields':['first_name', 'last_name']}}}
es_results = es_instance.search(index=self.index,
body=query_json,
size=limit,
from_=offset)
Performing our query
{
"took" : 63,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "my-new-index",
"_type" : "db-text",
"_id" : "1",
"sort": [0],
"_score" : null,
"_source": {"first_name": "Carmen", "last_name":"Sandiego","locations": ["New York",
"London", "Tangier"], "location_id": [1, 2, 3]}
}
]
}
}
search_results = []
for _id in raw_ids:
try:
search_results.append(Profile.objects.get(pk=_id))
except:
pass
return search_results
Populating the search results
How do we make Elasticsearch production ready?
Using celery to distribute the task of indexing
● Celery is a distributed task queuing system
● Since indexing is a memory-bound task we don’t want it tying up server
resources
● We’ll break up the task of indexing every one of our new documents initially
into Elasticsearch into a separate task controlled by a larger master task
● New documents can be added incremental to our existing index by firing off a
separate task
from celery import group, task
@task
def index_all_docs():
...
group(process_doc.si(profile_id) for profile_id in profile_ids)()
@task
def process_doc(profile_id):
How do we keep these datastores in sync?
def save(self, *args, **kwargs):
super(Profile, self).save(*args, **kwargs)
celery.current_app.send_task('search_indexer.add_doc'
(self.id,))
Syncing data to Elasticsearch
Great, but what about partial updates?
def update_doc(self, doc_id, data):
es_instance = Elasticsearch('https://my_elasticsearchserver')
es_instance.update(index='my-new-index', doc_type='db-text', id=doc_id,
body={'doc': json.loads(data)}, refresh=True)
Resources
● Code examples from today - http://bit.ly/python-search
● Elasticsearch-py - https://github.com/elastic/elasticsearch-py
● Elasticsearch official docs - https://www.elastic.co/guide/index.html
● Celery - https://github.com/celery/celery/
Thank you!
watch @greedoshotlast for these slides

More Related Content

Similar to Indexing all the things: Building your search engine in python

Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
Amit Juneja
 
Php
PhpPhp
Building Services With gRPC, Docker and Go
Building Services With gRPC, Docker and GoBuilding Services With gRPC, Docker and Go
Building Services With gRPC, Docker and Go
Martin Kess
 
Elasticsearch for Data Engineers
Elasticsearch for Data EngineersElasticsearch for Data Engineers
Elasticsearch for Data Engineers
Duy Do
 
Python RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutionsPython RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutions
Solution4Future
 
Database madness with_mongoengine_and_sql_alchemy
Database madness with_mongoengine_and_sql_alchemyDatabase madness with_mongoengine_and_sql_alchemy
Database madness with_mongoengine_and_sql_alchemy
Jaime Buelta
 
High Performance Django 1
High Performance Django 1High Performance Django 1
High Performance Django 1
DjangoCon2008
 
High Performance Django
High Performance DjangoHigh Performance Django
High Performance Django
DjangoCon2008
 
Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB
 Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB
Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB
MongoDB
 
Php
PhpPhp
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
iMasters
 
Find Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle TextFind Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle Text
Carsten Czarski
 
Elastic tire demo
Elastic tire demoElastic tire demo
Elastic tire demo
Scott Hamilton
 
Back to Basics 2017: Mí primera aplicación MongoDB
Back to Basics 2017: Mí primera aplicación MongoDBBack to Basics 2017: Mí primera aplicación MongoDB
Back to Basics 2017: Mí primera aplicación MongoDB
MongoDB
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)
Ki Sung Bae
 
The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)
Ki Sung Bae
 
Rapid web development, the right way.
Rapid web development, the right way.Rapid web development, the right way.
Rapid web development, the right way.
nubela
 
Building LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBBuilding LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDB
MongoDB
 
Building Better Applications with Data::Manager
Building Better Applications with Data::ManagerBuilding Better Applications with Data::Manager
Building Better Applications with Data::Manager
Jay Shirley
 

Similar to Indexing all the things: Building your search engine in python (20)

Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
Php
PhpPhp
Php
 
Building Services With gRPC, Docker and Go
Building Services With gRPC, Docker and GoBuilding Services With gRPC, Docker and Go
Building Services With gRPC, Docker and Go
 
Elasticsearch for Data Engineers
Elasticsearch for Data EngineersElasticsearch for Data Engineers
Elasticsearch for Data Engineers
 
Python RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutionsPython RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutions
 
Database madness with_mongoengine_and_sql_alchemy
Database madness with_mongoengine_and_sql_alchemyDatabase madness with_mongoengine_and_sql_alchemy
Database madness with_mongoengine_and_sql_alchemy
 
High Performance Django 1
High Performance Django 1High Performance Django 1
High Performance Django 1
 
High Performance Django
High Performance DjangoHigh Performance Django
High Performance Django
 
Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB
 Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB
Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB
 
Php
PhpPhp
Php
 
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
 
Find Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle TextFind Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle Text
 
Elastic tire demo
Elastic tire demoElastic tire demo
Elastic tire demo
 
Back to Basics 2017: Mí primera aplicación MongoDB
Back to Basics 2017: Mí primera aplicación MongoDBBack to Basics 2017: Mí primera aplicación MongoDB
Back to Basics 2017: Mí primera aplicación MongoDB
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)
 
The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)
 
Rapid web development, the right way.
Rapid web development, the right way.Rapid web development, the right way.
Rapid web development, the right way.
 
Building LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBBuilding LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDB
 
Building Better Applications with Data::Manager
Building Better Applications with Data::ManagerBuilding Better Applications with Data::Manager
Building Better Applications with Data::Manager
 

Recently uploaded

Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 

Recently uploaded (20)

Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 

Indexing all the things: Building your search engine in python

  • 1. Indexing all the things: Building your search engine in Python Joe Cabrera @greedoshotlast
  • 2. Joe Cabrera Hi, I’m ● Senior Backend Engineer at Jopwell ● Python Programmer since 2009 ● Building scalable search and backend systems for about 2 years ● Author of various open source Python projects
  • 4. Trying to find Carmen Sandiego in SQL ● We could start by using LIKE with wildcards ~ 91 sec / 1M rec, low accuracy SELECT * FROM profile JOIN profile_location JOIN location WHERE first_name like ‘%Carmen%’ and last_name like ‘%Sandiego%’ ● But wait we could also use full-text search ~ 8 min / 1 M rec, higher accuracy SELECT * FROM profiles JOIN profile_location JOIN location WHERE first_name || ‘ ‘ || last_name @@ ‘Carmen Sandiego’
  • 5. Great, but... ● MySQL has very limited support for full-text search ● Custom features may not be supported if you are using Postgres RDS ● You start getting lots of long custom SQL queries ● We’ll going to have to manage our own database sharding
  • 6. Enter Elasticsearch ● Built on-top of the Lucene search library ● Designed to be distributed ● Full-text indexing and search engine ● Features a common interface: JSON over HTTP
  • 7. { "doc" : { "first_name": "Carmen", "last_name": "Sandiego", "locations": [ "New York", "London", "Tangier" ], "location_id": [ 1, 2, 3 ] } }
  • 8. def index_single_doc(field_names, profile): index = {} for field_name in field_names: field_value = getattr(profile, field_name) index[field_name] = field_value return index Flattening our documents
  • 9. location_names = [] location_ids = [] for p in profile.locations.all(): location_names.append(str(p)) location_ids.append(p.id) What about data in related tables?
  • 10. Indexing our document into Elasticsearch def add_doc(self, data, id=doc_id): es_instance = Elasticsearch('https://my_elasticsearchserver') es_instance.index(index='my-index', doc_type='db-text',id=doc_id, body=data, refresh=True)
  • 11. Getting the data back out of Elasticsearch ● We’ll first need to perform our query to Elasticsearch ● Then grab the doc ids from the search results ● Use the doc ids to load the profiles from our database for the final search result response
  • 12. query_json = {'query': {'simple_query_string': {'query': 'Carmen Sandiego', 'fields':['first_name', 'last_name']}}} es_results = es_instance.search(index=self.index, body=query_json, size=limit, from_=offset) Performing our query
  • 13. { "took" : 63, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : null, "hits" : [ { "_index" : "my-new-index", "_type" : "db-text", "_id" : "1", "sort": [0], "_score" : null, "_source": {"first_name": "Carmen", "last_name":"Sandiego","locations": ["New York", "London", "Tangier"], "location_id": [1, 2, 3]} } ] } }
  • 14. search_results = [] for _id in raw_ids: try: search_results.append(Profile.objects.get(pk=_id)) except: pass return search_results Populating the search results
  • 15. How do we make Elasticsearch production ready?
  • 16. Using celery to distribute the task of indexing ● Celery is a distributed task queuing system ● Since indexing is a memory-bound task we don’t want it tying up server resources ● We’ll break up the task of indexing every one of our new documents initially into Elasticsearch into a separate task controlled by a larger master task ● New documents can be added incremental to our existing index by firing off a separate task
  • 17. from celery import group, task @task def index_all_docs(): ... group(process_doc.si(profile_id) for profile_id in profile_ids)() @task def process_doc(profile_id):
  • 18. How do we keep these datastores in sync?
  • 19. def save(self, *args, **kwargs): super(Profile, self).save(*args, **kwargs) celery.current_app.send_task('search_indexer.add_doc' (self.id,)) Syncing data to Elasticsearch
  • 20. Great, but what about partial updates?
  • 21. def update_doc(self, doc_id, data): es_instance = Elasticsearch('https://my_elasticsearchserver') es_instance.update(index='my-new-index', doc_type='db-text', id=doc_id, body={'doc': json.loads(data)}, refresh=True)
  • 22. Resources ● Code examples from today - http://bit.ly/python-search ● Elasticsearch-py - https://github.com/elastic/elasticsearch-py ● Elasticsearch official docs - https://www.elastic.co/guide/index.html ● Celery - https://github.com/celery/celery/