Indexing all the things: Building your search engine in python

Joe Cabrera
Joe CabreraSoftware Engineer at Hearst
Indexing all the things:
Building your search
engine in Python
Joe Cabrera
@greedoshotlast
Joe Cabrera
Hi, I’m
● Senior Backend Engineer at Jopwell
● Python Programmer since 2009
● Building scalable search and backend
systems for about 2 years
● Author of various open source Python
projects
Our database setup
Trying to find Carmen Sandiego in SQL
● We could start by using LIKE with wildcards ~ 91 sec / 1M rec, low accuracy
SELECT * FROM profile JOIN profile_location JOIN location WHERE
first_name like ‘%Carmen%’ and last_name like ‘%Sandiego%’
● But wait we could also use full-text search ~ 8 min / 1 M rec, higher accuracy
SELECT * FROM profiles JOIN profile_location JOIN location WHERE
first_name || ‘ ‘ || last_name @@ ‘Carmen Sandiego’
Great, but...
● MySQL has very limited support for full-text search
● Custom features may not be supported if you are using Postgres RDS
● You start getting lots of long custom SQL queries
● We’ll going to have to manage our own database sharding
Enter Elasticsearch
● Built on-top of the Lucene search library
● Designed to be distributed
● Full-text indexing and search engine
● Features a common interface: JSON over HTTP
{
"doc" : {
"first_name": "Carmen",
"last_name": "Sandiego",
"locations": [
"New York",
"London",
"Tangier"
],
"location_id": [
1,
2,
3
]
}
}
def index_single_doc(field_names, profile):
index = {}
for field_name in field_names:
field_value = getattr(profile, field_name)
index[field_name] = field_value
return index
Flattening our documents
location_names = []
location_ids = []
for p in profile.locations.all():
location_names.append(str(p))
location_ids.append(p.id)
What about data in related tables?
Indexing our document into Elasticsearch
def add_doc(self, data, id=doc_id):
es_instance = Elasticsearch('https://my_elasticsearchserver')
es_instance.index(index='my-index', doc_type='db-text',id=doc_id, body=data, refresh=True)
Getting the data back out of Elasticsearch
● We’ll first need to perform our query to Elasticsearch
● Then grab the doc ids from the search results
● Use the doc ids to load the profiles from our database for the final search
result response
query_json = {'query': {'simple_query_string': {'query': 'Carmen Sandiego',
'fields':['first_name', 'last_name']}}}
es_results = es_instance.search(index=self.index,
body=query_json,
size=limit,
from_=offset)
Performing our query
{
"took" : 63,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "my-new-index",
"_type" : "db-text",
"_id" : "1",
"sort": [0],
"_score" : null,
"_source": {"first_name": "Carmen", "last_name":"Sandiego","locations": ["New York",
"London", "Tangier"], "location_id": [1, 2, 3]}
}
]
}
}
search_results = []
for _id in raw_ids:
try:
search_results.append(Profile.objects.get(pk=_id))
except:
pass
return search_results
Populating the search results
How do we make Elasticsearch production ready?
Using celery to distribute the task of indexing
● Celery is a distributed task queuing system
● Since indexing is a memory-bound task we don’t want it tying up server
resources
● We’ll break up the task of indexing every one of our new documents initially
into Elasticsearch into a separate task controlled by a larger master task
● New documents can be added incremental to our existing index by firing off a
separate task
from celery import group, task
@task
def index_all_docs():
...
group(process_doc.si(profile_id) for profile_id in profile_ids)()
@task
def process_doc(profile_id):
How do we keep these datastores in sync?
def save(self, *args, **kwargs):
super(Profile, self).save(*args, **kwargs)
celery.current_app.send_task('search_indexer.add_doc'
(self.id,))
Syncing data to Elasticsearch
Great, but what about partial updates?
def update_doc(self, doc_id, data):
es_instance = Elasticsearch('https://my_elasticsearchserver')
es_instance.update(index='my-new-index', doc_type='db-text', id=doc_id,
body={'doc': json.loads(data)}, refresh=True)
Resources
● Code examples from today - http://bit.ly/python-search
● Elasticsearch-py - https://github.com/elastic/elasticsearch-py
● Elasticsearch official docs - https://www.elastic.co/guide/index.html
● Celery - https://github.com/celery/celery/
Thank you!
watch @greedoshotlast for these slides
1 of 23

Recommended

Creating Operational Redundancy for Effective Web Data Mining by
Creating Operational Redundancy for Effective Web Data MiningCreating Operational Redundancy for Effective Web Data Mining
Creating Operational Redundancy for Effective Web Data MiningJonathan LeBlanc
2.1K views39 slides
Securing and Personalizing Commerce Using Identity Data Mining by
Securing and Personalizing Commerce Using Identity Data MiningSecuring and Personalizing Commerce Using Identity Data Mining
Securing and Personalizing Commerce Using Identity Data MiningJonathan LeBlanc
1.1K views31 slides
Not Really PHP by the book by
Not Really PHP by the bookNot Really PHP by the book
Not Really PHP by the bookRyan Kilfedder
502 views39 slides
Contacto server API in PHP by
Contacto server API in PHPContacto server API in PHP
Contacto server API in PHPHem Shrestha
604 views13 slides
Google Cloud Challenge - PHP - DevFest GDG-Cairo by
Google Cloud Challenge - PHP - DevFest GDG-Cairo Google Cloud Challenge - PHP - DevFest GDG-Cairo
Google Cloud Challenge - PHP - DevFest GDG-Cairo Haitham Nabil
2.5K views24 slides
DRUPAL AND ELASTICSEARCH by
DRUPAL AND ELASTICSEARCHDRUPAL AND ELASTICSEARCH
DRUPAL AND ELASTICSEARCHDrupalCamp Kyiv
649 views52 slides

More Related Content

Similar to Indexing all the things: Building your search engine in python

Elasticsearch an overview by
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overviewAmit Juneja
90 views42 slides
Php by
PhpPhp
Phpkhushbulakhani1
60 views35 slides
Building Services With gRPC, Docker and Go by
Building Services With gRPC, Docker and GoBuilding Services With gRPC, Docker and Go
Building Services With gRPC, Docker and GoMartin Kess
1.4K views61 slides
Elasticsearch for Data Engineers by
Elasticsearch for Data EngineersElasticsearch for Data Engineers
Elasticsearch for Data EngineersDuy Do
1.1K views57 slides
Python RESTful webservices with Python: Flask and Django solutions by
Python RESTful webservices with Python: Flask and Django solutionsPython RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutionsSolution4Future
72.5K views29 slides
Database madness with_mongoengine_and_sql_alchemy by
Database madness with_mongoengine_and_sql_alchemyDatabase madness with_mongoengine_and_sql_alchemy
Database madness with_mongoengine_and_sql_alchemyJaime Buelta
1.1K views26 slides

Similar to Indexing all the things: Building your search engine in python(20)

Elasticsearch an overview by Amit Juneja
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
Amit Juneja90 views
Building Services With gRPC, Docker and Go by Martin Kess
Building Services With gRPC, Docker and GoBuilding Services With gRPC, Docker and Go
Building Services With gRPC, Docker and Go
Martin Kess1.4K views
Elasticsearch for Data Engineers by Duy Do
Elasticsearch for Data EngineersElasticsearch for Data Engineers
Elasticsearch for Data Engineers
Duy Do1.1K views
Python RESTful webservices with Python: Flask and Django solutions by Solution4Future
Python RESTful webservices with Python: Flask and Django solutionsPython RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutions
Solution4Future72.5K views
Database madness with_mongoengine_and_sql_alchemy by Jaime Buelta
Database madness with_mongoengine_and_sql_alchemyDatabase madness with_mongoengine_and_sql_alchemy
Database madness with_mongoengine_and_sql_alchemy
Jaime Buelta1.1K views
High Performance Django 1 by DjangoCon2008
High Performance Django 1High Performance Django 1
High Performance Django 1
DjangoCon20081.1K views
High Performance Django by DjangoCon2008
High Performance DjangoHigh Performance Django
High Performance Django
DjangoCon20081.3K views
Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB by MongoDB
 Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB
Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB
MongoDB3.7K views
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP by iMasters
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
iMasters1.3K views
Find Anything In Your APEX App - Fuzzy Search with Oracle Text by Carsten Czarski
Find Anything In Your APEX App - Fuzzy Search with Oracle TextFind Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle Text
Carsten Czarski1.7K views
Back to Basics 2017: Mí primera aplicación MongoDB by MongoDB
Back to Basics 2017: Mí primera aplicación MongoDBBack to Basics 2017: Mí primera aplicación MongoDB
Back to Basics 2017: Mí primera aplicación MongoDB
MongoDB3K views
Introducing Apache Spark's Data Frames and Dataset APIs workshop series by Holden Karau
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau1.4K views
The art of readable code (ch1~ch4) by Ki Sung Bae
The art of readable code (ch1~ch4)The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)
Ki Sung Bae17.3K views
The art of readable code (ch1~ch4) by Ki Sung Bae
The art of readable code (ch1~ch4)The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)
Ki Sung Bae598 views
Rapid web development, the right way. by nubela
Rapid web development, the right way.Rapid web development, the right way.
Rapid web development, the right way.
nubela598 views
Building LinkedIn's Learning Platform with MongoDB by MongoDB
Building LinkedIn's Learning Platform with MongoDBBuilding LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDB
MongoDB10.5K views
Building Better Applications with Data::Manager by Jay Shirley
Building Better Applications with Data::ManagerBuilding Better Applications with Data::Manager
Building Better Applications with Data::Manager
Jay Shirley1.7K views

Recently uploaded

Network Source of Truth and Infrastructure as Code revisited by
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisitedNetwork Automation Forum
26 views45 slides
Piloting & Scaling Successfully With Microsoft Viva by
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft VivaRichard Harbridge
12 views160 slides
Voice Logger - Telephony Integration Solution at Aegis by
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at AegisNirmal Sharma
39 views1 slide
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院IttrainingIttraining
52 views8 slides
Democratising digital commerce in India-Report by
Democratising digital commerce in India-ReportDemocratising digital commerce in India-Report
Democratising digital commerce in India-ReportKapil Khandelwal (KK)
15 views161 slides
Ransomware is Knocking your Door_Final.pdf by
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
55 views46 slides

Recently uploaded(20)

Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10248 views
AMAZON PRODUCT RESEARCH.pdf by JerikkLaureta
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta26 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker37 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2217 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab19 views

Indexing all the things: Building your search engine in python

  • 1. Indexing all the things: Building your search engine in Python Joe Cabrera @greedoshotlast
  • 2. Joe Cabrera Hi, I’m ● Senior Backend Engineer at Jopwell ● Python Programmer since 2009 ● Building scalable search and backend systems for about 2 years ● Author of various open source Python projects
  • 4. Trying to find Carmen Sandiego in SQL ● We could start by using LIKE with wildcards ~ 91 sec / 1M rec, low accuracy SELECT * FROM profile JOIN profile_location JOIN location WHERE first_name like ‘%Carmen%’ and last_name like ‘%Sandiego%’ ● But wait we could also use full-text search ~ 8 min / 1 M rec, higher accuracy SELECT * FROM profiles JOIN profile_location JOIN location WHERE first_name || ‘ ‘ || last_name @@ ‘Carmen Sandiego’
  • 5. Great, but... ● MySQL has very limited support for full-text search ● Custom features may not be supported if you are using Postgres RDS ● You start getting lots of long custom SQL queries ● We’ll going to have to manage our own database sharding
  • 6. Enter Elasticsearch ● Built on-top of the Lucene search library ● Designed to be distributed ● Full-text indexing and search engine ● Features a common interface: JSON over HTTP
  • 7. { "doc" : { "first_name": "Carmen", "last_name": "Sandiego", "locations": [ "New York", "London", "Tangier" ], "location_id": [ 1, 2, 3 ] } }
  • 8. def index_single_doc(field_names, profile): index = {} for field_name in field_names: field_value = getattr(profile, field_name) index[field_name] = field_value return index Flattening our documents
  • 9. location_names = [] location_ids = [] for p in profile.locations.all(): location_names.append(str(p)) location_ids.append(p.id) What about data in related tables?
  • 10. Indexing our document into Elasticsearch def add_doc(self, data, id=doc_id): es_instance = Elasticsearch('https://my_elasticsearchserver') es_instance.index(index='my-index', doc_type='db-text',id=doc_id, body=data, refresh=True)
  • 11. Getting the data back out of Elasticsearch ● We’ll first need to perform our query to Elasticsearch ● Then grab the doc ids from the search results ● Use the doc ids to load the profiles from our database for the final search result response
  • 12. query_json = {'query': {'simple_query_string': {'query': 'Carmen Sandiego', 'fields':['first_name', 'last_name']}}} es_results = es_instance.search(index=self.index, body=query_json, size=limit, from_=offset) Performing our query
  • 13. { "took" : 63, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : null, "hits" : [ { "_index" : "my-new-index", "_type" : "db-text", "_id" : "1", "sort": [0], "_score" : null, "_source": {"first_name": "Carmen", "last_name":"Sandiego","locations": ["New York", "London", "Tangier"], "location_id": [1, 2, 3]} } ] } }
  • 14. search_results = [] for _id in raw_ids: try: search_results.append(Profile.objects.get(pk=_id)) except: pass return search_results Populating the search results
  • 15. How do we make Elasticsearch production ready?
  • 16. Using celery to distribute the task of indexing ● Celery is a distributed task queuing system ● Since indexing is a memory-bound task we don’t want it tying up server resources ● We’ll break up the task of indexing every one of our new documents initially into Elasticsearch into a separate task controlled by a larger master task ● New documents can be added incremental to our existing index by firing off a separate task
  • 17. from celery import group, task @task def index_all_docs(): ... group(process_doc.si(profile_id) for profile_id in profile_ids)() @task def process_doc(profile_id):
  • 18. How do we keep these datastores in sync?
  • 19. def save(self, *args, **kwargs): super(Profile, self).save(*args, **kwargs) celery.current_app.send_task('search_indexer.add_doc' (self.id,)) Syncing data to Elasticsearch
  • 20. Great, but what about partial updates?
  • 21. def update_doc(self, doc_id, data): es_instance = Elasticsearch('https://my_elasticsearchserver') es_instance.update(index='my-new-index', doc_type='db-text', id=doc_id, body={'doc': json.loads(data)}, refresh=True)
  • 22. Resources ● Code examples from today - http://bit.ly/python-search ● Elasticsearch-py - https://github.com/elastic/elasticsearch-py ● Elasticsearch official docs - https://www.elastic.co/guide/index.html ● Celery - https://github.com/celery/celery/