Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Погружение в полнотекстовый поиск, используя Python - Андрей Солдатенко, Wargaming.NET
1. Dive into
full text search
with Python
Andrii Soldatenko
18-19 September 2015
@a_soldatenko
2. About me:
• Lead QA Automation Engineer at
• Backend Python Developer at
• Speaker at PyCon Ukraine 2014
• Speaker at PyCon Belarus 2015
• @a_soldatenko
8. Simple sentences
1. The quick brown fox jumped over the lazy dog
2. Quick brown foxes leap over lazy dogs in summer
9. Inverted index
Term
Doc_1
Doc_2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Quick
|
|
X
The
|
X
|
brown
|
X
|
X
dog
|
X
|
dogs
|
|
X
fox
|
X
|
foxes
|
|
X
in
|
|
X
jumped
|
X
|
lazy
|
X
|
X
leap
|
|
X
over
|
X
|
X
quick
|
X
|
summer
|
|
X
the
|
X
|
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
10. Inverted index
Term
Doc_1
Doc_2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
brown
|
X
|
X
quick
|
X
|
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Total
|
2
|
1
11. Inverted index:
normalization
Term
Doc_1
Doc_2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
brown
|
X
|
X
dog
|
X
|
X
fox
|
X
|
X
in
|
|
X
jump
|
X
|
X
lazy
|
X
|
X
over
|
X
|
X
quick
|
X
|
X
summer
|
|
X
the
|
X
|
X
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Term
Doc_1
Doc_2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Quick
|
|
X
The
|
X
|
brown
|
X
|
X
dog
|
X
|
dogs
|
|
X
fox
|
X
|
foxes
|
|
X
in
|
|
X
jumped
|
X
|
lazy
|
X
|
X
leap
|
|
X
over
|
X
|
X
quick
|
X
|
summer
|
|
X
the
|
X
|
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
16. Full text search in
PostgreSQL
1.Creating tokens
2.Creating Lexems (Normaliztion)
3.storing preprocessed documents
4.Relevance ranking
17. Full text search in
PostgreSQL
27 built-in configurations for 10 languages
Support of user-defined FTS configurations
Pluggable dictionaries, parsers
Inverted indexes
18. functions to convert
normal text to tsvector
explain
SELECT
'a
fat
cat
sat
on
a
mat
and
ate
a
fat
rat'::tsvector
@@
'cat
&
rat’::tsquery;
QUERY
PLAN
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Result
(cost=0.00..0.01
rows=1
width=0)
(1
row)
explain
SELECT
'fat
&
cow'::tsquery
@@
'a
fat
cat
sat
on
a
mat
and
ate
a
fat
rat'::tsvector;
-‐-‐
false
QUERY
PLAN
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Result
(cost=0.00..0.01
rows=1
width=0)
(1
row)
19. PostgreSQL:
index management
CREATE
FUNCTION
notes_vector_update()
RETURNS
TRIGGER
AS
$$
BEGIN
IF
TG_OP
=
'INSERT'
THEN
new.search_index
=
to_tsvector('pg_catalog.english',
COALESCE(NEW.name,
''));
END
IF;
IF
TG_OP
=
'UPDATE'
THEN
IF
NEW.name
<>
OLD.name
THEN
new.search_index
=
to_tsvector('pg_catalog.english',
COALESCE(NEW.name,
''));
END
IF;
END
IF;
RETURN
NEW;
END
$$
LANGUAGE
'plpgsql';
20. PostgreSQL:
stopwords
SELECT
to_tsvector('english','in
the
list
of
stop
words');
to_tsvector
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
'list':3
'stop':5
'word':6
/usr/pgsql-9.3/share/tsearch_data/english.stop
22. PostgreSQL full-text search
integration with django orm
https://github.com/linuxlewis/djorm-ext-pgfulltext
from
djorm_pgfulltext.models
import
SearchManager
from
djorm_pgfulltext.fields
import
VectorField
from
django.db
import
models
class
Page(models.Model):
name
=
models.CharField(max_length=200)
description
=
models.TextField()
search_index
=
VectorField()
objects
=
SearchManager(
fields
=
('name',
'description'),
config
=
'pg_catalog.english',
#
this
is
default
search_field
=
'search_index',
#
this
is
default
auto_update_search_field
=
True
)
23. For search just use search
method of the manager
https://github.com/linuxlewis/djorm-ext-pgfulltext
>>>
Page.objects.search("documentation
&
about")
[<Page:
Page:
Home
page>]
>>>
Page.objects.search("about
|
documentation
|
django
|
home",
raw=True)
[<Page:
Page:
Home
page>,
<Page:
Page:
About>,
<Page:
Page:
Navigation>]
24. Second way
class
Page(models.Model):
name
=
models.CharField(max_length=200)
description
=
models.TextField()
objects
=
SearchManager(fields=None,
search_field=None)
>>>
Page.objects.search("documentation
&
about",
fields=('name',
'description'))
[<Page:
Page:
Home
page>]
>>>
Page.objects.search("about
|
documentation
|
django
|
home",
raw=True,
fields=('name',
'description'))
[<Page:
Page:
Home
page>,
<Page:
Page:
About>,
<Page:
Page:
Navigation>]
25. Pros and Cons
Pros:
• Quick implementation
• No dependency
Cons:
• Need manually manage indexes
• Not as flexible as pure search engines
• Not so fast as ElasticSearch
• tied to PostgreSQL
• no analytics data
• no DSL only `&` and `|` queries
• difficult to manage stop words
33. Adding search functionality
to Simple Model
$
cat
myapp/models.py
from
django.db
import
models
from
django.contrib.auth.models
import
User
class
Page(models.Model):
user
=
models.ForeignKey(User)
name
=
models.CharField(max_length=200)
description
=
models.TextField()
def
__unicode__(self):
return
self.name
36. Haystack:
Creating SearchIndexes
$
cat
myapp/search_indexes.py
import
datetime
from
haystack
import
indexes
from
myapp.models
import
Note
class
PageIndex(indexes.SearchIndex,
indexes.Indexable):
text
=
indexes.CharField(document=True,
use_template=True)
author
=
indexes.CharField(model_attr='user')
pub_date
=
indexes.DateTimeField(model_attr='pub_date')
def
get_model(self):
return
Note
def
index_queryset(self,
using=None):
"""Used
when
the
entire
index
for
model
is
updated."""
return
self.get_model().objects.
filter(pub_date__lte=datetime.datetime.now())
37. Haystack:
SearchQuerySet API
from
haystack.query
import
SearchQuerySet
from
haystack.inputs
import
Raw
all_results
=
SearchQuerySet().all()
hello_results
=
SearchQuerySet().filter(content='hello')
unfriendly_results
=
SearchQuerySet().
exclude(content=‘hello’).
filter(content=‘world’)
#
To
send
unescaped
data:
sqs
=
SearchQuerySet().filter(title=Raw(trusted_query))
38. Keeping data in sync
#
Update
everything.
./manage.py
update_index
-‐-‐settings=settings.prod
#
Update
everything
with
lots
of
information
about
what's
going
on.
./manage.py
update_index
-‐-‐settings=settings.prod
-‐-‐verbosity=2
#
Update
everything,
cleaning
up
after
deleted
models.
./manage.py
update_index
-‐-‐remove
-‐-‐settings=settings.prod
#
Update
everything
changed
in
the
last
2
hours.
./manage.py
update_index
-‐-‐age=2
-‐-‐settings=settings.prod
#
Update
everything
between
Dec.
1,
2011
&
Dec
31,
2011
./manage.py
update_index
-‐-‐start='2011-‐12-‐01T00:00:00'
-‐-‐
end='2011-‐12-‐31T23:59:59'
-‐-‐settings=settings.prod
39. Signals
class
RealtimeSignalProcessor(BaseSignalProcessor):
"""
Allows
for
observing
when
saves/deletes
fire
&
automatically
updates
the
search
engine
appropriately.
"""
def
setup(self):
#
Naive
(listen
to
all
model
saves).
models.signals.post_save.connect(self.handle_save)
models.signals.post_delete.connect(self.handle_delete)
#
Efficient
would
be
going
through
all
backends
&
collecting
all
models
#
being
used,
then
hooking
up
signals
only
for
those.
def
teardown(self):
#
Naive
(listen
to
all
model
saves).
models.signals.post_save.disconnect(self.handle_save)
models.signals.post_delete.disconnect(self.handle_delete)
#
Efficient
would
be
going
through
all
backends
&
collecting
all
models
#
being
used,
then
disconnecting
signals
only
for
those.
40. Haystack:
Pros and Cons
Pros:
• easy to setup
• looks like Django ORM but for searches
• search engine independent
• support 4 engines (Elastic, Solr, Xapian, Whoosh)
Cons:
• poor SearchQuerySet API
• difficult to manage stop words
• loose performance, because extra layer
• Model - based
41. Future FTS and
Roadmap Django 1.9
• PostgreSQL Full Text Search (Marc Tamlyn)
https://github.com/django/django/pull/4726
• Custom indexes (Marc Tamlyn)
• etc.