PyCon Russian 2015 - Dive into full text search with python.

Dive into
full text search
with Python
Andrii Soldatenko
18-19 September 2015
@a_soldatenko

About me:
• Lead QA Automation Engineer at
• Backend Python Developer at
• Speaker at PyCon Ukraine 2014
• Speaker at PyCon Belarus 2015
• @a_soldatenko

Text Search
grep
-‐-‐ignore-‐case
-‐-‐recursive
foo
books/

grep
-‐-‐ignore-‐case
-‐-‐recursive
-‐-‐file=words.txt
books/
Entry.objects.get(headline__icontains='foo')

words
=
[]

with
open('words.txt',
'r')
as
f:

words
=
f.readlines()

Entry.objects.get(headline__icontains_in=words)

Simple sentences
1. The quick brown fox jumped over the lazy dog
2. Quick brown foxes leap over lazy dogs in summer

Inverted index
Term

Doc_1

Doc_2

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

Quick

|

|

X

The

|

X

|

brown

|

X

|

X

dog

|

X

|

dogs

|

|

X

fox

|

X

|

foxes

|

|

X

in

|

|

X

jumped

|

X

|

lazy

|

X

|

X

leap

|

|

X

over

|

X

|

X

quick

|

X

|

summer

|

|

X

the

|

X

|

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

Inverted index
Term

Doc_1

Doc_2

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

brown

|

X

|

X

quick

|

X

|

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

Total

|

2

|

1

Inverted index:
normalization
Term

Doc_1

Doc_2

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

brown

|

X

|

X

dog

|

X

|

X

fox

|

X

|

X

in

|

|

X

jump

|

X

|

X

lazy

|

X

|

X

over

|

X

|

X

quick

|

X

|

X

summer

|

|

X

the

|

X

|

X

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Term

Doc_1

Doc_2

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

Quick

|

|

X

The

|

X

|

brown

|

X

|

X

dog

|

X

|

dogs

|

|

X

fox

|

X

|

foxes

|

|

X

in

|

|

X

jumped

|

X

|

lazy

|

X

|

X

leap

|

|

X

over

|

X

|

X

quick

|

X

|

summer

|

|

X

the

|

X

|

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

PostgreSQL: 
operators for textual data types
-‐-‐-‐
PostgreSQL
has
operators
for
textual
data
types:

-‐-‐-‐
LIKE
-‐
match
case-‐sensitive

-‐-‐-‐
ILIKE
-‐
match
case-‐insensitive

-‐-‐-‐
~
-‐
Matches
POSIX
regular
expression,
case-‐sensitive

-‐-‐-‐
~*
-‐
Matches
POSIX
regular
expression,
case-‐insensitive

select
'foo'
LIKE
'foo';

-‐-‐
true

select
'bar'
ILIKE
'BAR';

-‐-‐
true

select
'abc'
LIKE
'b';

-‐-‐
true

select
'abc'
LIKE
'c';

-‐-‐
false

select
'abc'
~
'abc';

-‐-‐
true

select
'abc'
~
'^a';

-‐-‐
true

select
'abc'
~
'(b|d)';

-‐-‐
true

select
'abc'
~
'^(b|c)';

-‐-‐
false

select
'andrii'
~*
'.*Andrii.*';
-‐-‐
true

PostgreSQL: 
accuracy issue
select
'prone'
like
'%one%';
-‐-‐true

select
'money'
like
'%one%';
-‐-‐true

select
'lonely'
like
'%one%';
-‐-‐true

Full text search in
PostgreSQL
1. Creating tokens
2. Converting tokens into Lexemes
3. Storing preprocessed documents

Full text search in
PostgreSQL
27 built-in configurations for 10 languages
Support of user-defined FTS configurations
Pluggable dictionaries, parsers
Inverted indexes

functions to convert
normal text to tsvector
explain
SELECT
'a
fat
cat
sat
on
a
mat
and
ate
a
fat
rat'::tsvector
@@

'cat
&
rat’::tsquery;

QUERY
PLAN

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

Result

(cost=0.00..0.01
rows=1
width=0)

(1
row)

explain
SELECT
'fat
&
cow'::tsquery
@@

'a
fat
cat
sat
on
a
mat
and
ate
a
fat
rat'::tsvector;
-‐-‐
false

QUERY
PLAN

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

Result

(cost=0.00..0.01
rows=1
width=0)

(1
row)

PostgreSQL: 
index management
CREATE
FUNCTION
notes_vector_update()
RETURNS
TRIGGER
AS
$$

BEGIN

IF
TG_OP
=
'INSERT'
THEN

new.search_index
=
to_tsvector('pg_catalog.english',
COALESCE(NEW.name,
''));

END
IF;

IF
TG_OP
=
'UPDATE'
THEN

IF
NEW.name
<>
OLD.name
THEN

new.search_index
=
to_tsvector('pg_catalog.english',
COALESCE(NEW.name,
''));

END
IF;

END
IF;

RETURN
NEW;

END

$$
LANGUAGE
'plpgsql';

PostgreSQL: 
stopwords
SELECT
to_tsvector('english','in
the
list
of
stop
words');

to_tsvector

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

'list':3
'stop':5
'word':6
/usr/pgsql-9.3/share/tsearch_data/english.stop

Malcolm Tredinnick's Advice
on Writing SQL in Django :
“︎If you need to write advanced SQL you should write it.
I would balance that by cautioning against
overuse of the raw() and extra() methods.”

PostgreSQL full-text search
integration with django orm
https://github.com/linuxlewis/djorm-ext-pgfulltext
from
djorm_pgfulltext.models
import
SearchManager

from
djorm_pgfulltext.fields
import
VectorField

from
django.db
import
models

class
Page(models.Model):

name
=
models.CharField(max_length=200)

description
=
models.TextField()

search_index
=
VectorField()

objects
=
SearchManager(

fields
=
('name',
'description'),

config
=
'pg_catalog.english',
#
this
is
default

search_field
=
'search_index',
#
this
is
default

auto_update_search_field
=
True

)

For search just use search
method of the manager
https://github.com/linuxlewis/djorm-ext-pgfulltext
>>>
Page.objects.search("documentation
&
about")

[<Page:
Page:
Home
page>]

>>>
Page.objects.search("about
|
documentation
|
django
|
home",
raw=True)

[<Page:
Page:
Home
page>,
<Page:
Page:
About>,
<Page:
Page:
Navigation>]

Second way
class
Page(models.Model):

name
=

description
=
models.TextField()

objects
=
SearchManager(fields=None,
search_field=None)

>>>
Page.objects.search("documentation
&
about",
fields=('name',

'description'))

[<Page:
Page:
Home
page>]

>>>
Page.objects.search("about
|
documentation
|
django
|
home",

raw=True,
fields=('name',
'description'))

[<Page:
Page:
Home
page>,
<Page:
Page:
About>,
<Page:
Page:

Navigation>]

Pros and Cons
Pros:
• Quick implementation
• No dependency
Cons:
• Need manually manage indexes
• Not as ﬂexible as pure search engines
• tied to PostgreSQL
• no analytics data
• no DSL only `&` and `|` queries
• difﬁcult to manage stop words

ElasticSearch:
Quick Intro
Relational DB Databases TablesRows Columns
ElasticSearch Indices FieldsTypes Documents

ElasticSearch:
Quick Intro
PUT
/haystack/user/1

{

"first_name"
:
"Andrii",

"last_name"
:

"Soldatenko",

"age"
:

30,

"about"
:

"I
love
to
go
rock
climbing",

"interests":
[
"sports",
"music"
],

"likes":
[
"python",
"django"
]

}

ElasticSearch:
Locks
•Pessimistic concurrency control
•Optimistic concurrency control

ElasticSearch:
Setup
#!/bin/bash

VERSION=1.7.1

curl
-‐L
-‐O
https://download.elastic.co/elasticsearch/elasticsearch/
elasticsearch-‐$VERSION.zip

unzip
elasticsearch-‐$VERSION.zip

cd
elasticsearch-‐$VERSION

#
Download
plugin
marvel

./bin/plugin
-‐i
elasticsearch/marvel/latest

echo
'marvel.agent.enabled:
false'
>>
./config/elasticsearch.yml

#
run
elastic

./bin/elasticsearch
-‐d

ElasticSearch:
Setup
$
curl
‘http://localhost:9200/?pretty'

{

"status"
:
200,

"name"
:
"Dredmund
Druid",

"cluster_name"
:
"elasticsearch",

"version"
:
{

"number"
:
"1.7.1",

"build_hash"
:
"b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",

"build_timestamp"
:
"2015-‐07-‐29T09:54:16Z",

"build_snapshot"
:
false,

"lucene_version"
:
"4.10.4"

},

"tagline"
:
"You
Know,
for
Search"

}

Adding search functionality
to Simple Model
$
cat
myapp/models.py

from
django.db
import
models

from
django.contrib.auth.models
import
User

class
Page(models.Model):

user
=
models.ForeignKey(User)

name
=

description
=
models.TextField()

def
__unicode__(self):

return
self.name

Haystack: Installation
$
pip
install
django-‐haystack

$
cat
settings.py

INSTALLED_APPS
=
[

'django.contrib.admin',

'django.contrib.auth',

'django.contrib.contenttypes',

'django.contrib.sessions',

'django.contrib.sites',

#
Added.

'haystack',

#
Then
your
usual
apps...

'blog',

]

Haystack: Installation
$
pip
install
elasticsearch

$
cat
settings.py

...

HAYSTACK_CONNECTIONS
=
{

'default':
{

'ENGINE':

'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine',

'URL':
'http://127.0.0.1:9200/',

'INDEX_NAME':
'haystack',

},

}

...

Haystack:
Creating SearchIndexes
$
cat
myapp/search_indexes.py

import
datetime

from
haystack
import
indexes

from
myapp.models
import
Note

class
PageIndex(indexes.SearchIndex,
indexes.Indexable):

text
=
indexes.CharField(document=True,
use_template=True)

author
=
indexes.CharField(model_attr='user')

pub_date
=
indexes.DateTimeField(model_attr='pub_date')

def
get_model(self):

return
Note

def
index_queryset(self,
using=None):

"""Used
when
the
entire
index
for
model
is
updated."""

return
self.get_model().objects.

filter(pub_date__lte=datetime.datetime.now())

Haystack:
SearchQuerySet API
from
haystack.query
import
SearchQuerySet

from
haystack.inputs
import
Raw

all_results
=
SearchQuerySet().all()

hello_results
=
SearchQuerySet().filter(content='hello')

unfriendly_results
=
SearchQuerySet().

exclude(content=‘hello’).

filter(content=‘world’)

#
To
send
unescaped
data:

sqs
=
SearchQuerySet().filter(title=Raw(trusted_query))

Keeping data in sync
#
Update
everything.

./manage.py
update_index
-‐-‐settings=settings.prod

#
Update
everything
with
lots
of
information
about
what's
going
on.

./manage.py
update_index
-‐-‐verbosity=2

#
Update
everything,
cleaning
up
after
deleted
models.

./manage.py
update_index
-‐-‐remove

#
Update
everything
changed
in
the
last
2
hours.

./manage.py
update_index
-‐-‐age=2

#
Update
everything
between
Dec.
1,
2011
&
Dec
31,
2011

./manage.py
update_index
-‐-‐start='2011-‐12-‐01T00:00:00'
-‐-‐
end='2011-‐12-‐31T23:59:59'

Signals
class
RealtimeSignalProcessor(BaseSignalProcessor):

"""

Allows
for
observing
when
saves/deletes
fire
&
automatically
updates
the

search
engine
appropriately.

"""

def
setup(self):

#
Naive
(listen
to
all
model
saves).

models.signals.post_save.connect(self.handle_save)

models.signals.post_delete.connect(self.handle_delete)

#
Efficient
would
be
going
through
all
backends
&
collecting
all
models

#
being
used,
then
hooking
up
signals
only
for
those.

def
teardown(self):

#
Naive
(listen
to
all
model
saves).

models.signals.post_save.disconnect(self.handle_save)

models.signals.post_delete.disconnect(self.handle_delete)

#
Efficient
would
be
going
through
all
backends
&
collecting
all
models

#
being
used,
then
disconnecting
signals
only
for
those.

Haystack:
Pros and Cons
Pros:
• easy to setup
• looks like Django ORM but for searches
• search engine independent
• support 4 engines (Elastic, Solr, Xapian, Whoosh)
Cons:
• poor SearchQuerySet API
• difﬁcult to manage stop words
• loose performance, because extra layer
• Model - based

Future FTS and
Roadmap Django 1.9
• PostgreSQL Full Text Search (Marc Tamlyn)
https://github.com/django/django/pull/4726
• Custom indexes (Marc Tamlyn)
• etc.

Final Thoughts
https://www.elastic.co/guide/en/elasticsearch/guide/master/
index.html

Thank You
a_soldatenko@wargaming.net
@a_soldatenko
https://asoldatenko.com

We are hiring
a_soldatenko@wargaming.net

PyCon Russian 2015 - Dive into full text search with python.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to PyCon Russian 2015 - Dive into full text search with python.

Similar to PyCon Russian 2015 - Dive into full text search with python. (20)

More from Andrii Soldatenko

More from Andrii Soldatenko (6)

Recently uploaded

Recently uploaded (20)

PyCon Russian 2015 - Dive into full text search with python.