The document discusses full text search in Python. It begins with an introduction to the speaker and covers information explosion and text search tools like grep. It then explains search indexes and inverted indexes using examples. The document discusses normalization in indexes and search in databases like PostgreSQL. It describes operators for textual data types in PostgreSQL for matching strings and regular expressions.
What is the best full text search engine for Python?Andrii Soldatenko
Nowadays we can see lot’s of benchmarks and performance tests of different web frameworks and Python tools. Regarding to search engines, it’s difficult to find useful information especially benchmarks or comparing between different search engines. It’s difficult to manage what search engine you should select for instance, ElasticSearch, Postgres Full Text Search or may be Sphinx or Whoosh. You face a difficult choice, that’s why I am pleased to share with you my acquired experience and benchmarks and focus on how to compare full text search engines for Python.
Social phenomena is coming. We have lot’s of social applications that we are using every day, let’s say Facebook, twitter, Instagram. Lot’s of such kind apps based on social graph and graph theory. I would like to share my knowledge and expertise about how to work with graphs and build large social graph as engine for Social network using python and Graph databases. We'll compare SQL and NoSQL approaches for friends relationships.
There are number of players that provide full text search feature, starting from embedded search to dedicated search servers [solr, sphinx, elasticsearch etc], but setting up and configuring them is a time consuming process and requires considerable knowledge of the tools.
What if we could get comparable search results using full text search capabilities of Postgres. Developers already have the working knowledge of the database, so this should come natural. In addition to that, it will be one less tool to manage.
Code: https://github.com/Syerram/postgres_search
In this slide, we introduce the mechanism of Solr used in Search Engine Back End API Solution for Fast Prototyping (LDSP). You will learn how to create a new core, update schema, query and sort in Solr.
What is the best full text search engine for Python?Andrii Soldatenko
Nowadays we can see lot’s of benchmarks and performance tests of different web frameworks and Python tools. Regarding to search engines, it’s difficult to find useful information especially benchmarks or comparing between different search engines. It’s difficult to manage what search engine you should select for instance, ElasticSearch, Postgres Full Text Search or may be Sphinx or Whoosh. You face a difficult choice, that’s why I am pleased to share with you my acquired experience and benchmarks and focus on how to compare full text search engines for Python.
Social phenomena is coming. We have lot’s of social applications that we are using every day, let’s say Facebook, twitter, Instagram. Lot’s of such kind apps based on social graph and graph theory. I would like to share my knowledge and expertise about how to work with graphs and build large social graph as engine for Social network using python and Graph databases. We'll compare SQL and NoSQL approaches for friends relationships.
There are number of players that provide full text search feature, starting from embedded search to dedicated search servers [solr, sphinx, elasticsearch etc], but setting up and configuring them is a time consuming process and requires considerable knowledge of the tools.
What if we could get comparable search results using full text search capabilities of Postgres. Developers already have the working knowledge of the database, so this should come natural. In addition to that, it will be one less tool to manage.
Code: https://github.com/Syerram/postgres_search
In this slide, we introduce the mechanism of Solr used in Search Engine Back End API Solution for Fast Prototyping (LDSP). You will learn how to create a new core, update schema, query and sort in Solr.
Understanding Graph Databases with Neo4j and CypherRuhaim Izmeth
Inroduction of Grapgh database concepts, explained by comparing the widely popular relational databases and the the sql query language. Neo4j and cypher is used to describe how graph databases work in real life
Node collaboration - Exported Resources and PuppetDBm_richardson
Node Collaboration - How can your servers share information with each other. Exploring Exported Resources, PuppetDB and other methods.
This talk was given at Sydney Puppet Users Meetup on 14/08/2014.
Webscale PostgreSQL - JSONB and Horizontal Scaling StrategiesJonathan Katz
All data is relational and can be represented through relational algebra, right? Perhaps, but there are other ways to represent data, and the PostgreSQL team continues to work on making it easier and more efficient to do so!
With the upcoming 9.4 release, PostgreSQL is introducing the "JSONB" data type which allows for fast, compressed, storage of JSON formatted data, and for quick retrieval. And JSONB comes with all the benefits of PostgreSQL, like its data durability, MVCC, and of course, access to all the other data types and features in PostgreSQL.
How fast is JSONB? How do we access data stored with this type? What can it do with the rest of PostgreSQL? What can't it do? How can we leverage this new data type and make PostgreSQL scale horizontally? Follow along with our presentation as we try to answer these questions.
An overview of how a web search engine is organized is provided. A key component of the AltaVista search engine: its indexing library, is described in more depth. The library manages a set of inverted files, and provides mechanisms to construct and optimize complex queries on those inverted files. The design goals were to enable efficient queries on bodies of text up to a few hundred gigabytes in size (e.g. AltaVista) without sacrificing too much generality, and without giving up on small applications (e.g. mail directories).
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppet
James Sweeney presents on "PuppetDB: A Single Source for Storing Your Puppet Data" at Puppet User Group NYC.
Video: http://www.youtube.com/watch?v=HTr4b02aU7A
Puppet NYC: http://www.meetup.com/puppetnyc-meetings/
Doing Horrible Things with DNS - Web Directions SouthTom Croucher
How can we make use of DNS to improve the performance of web sites? A simple introduction to DNS and a neat trick to improve web site performance using DNS.
In this talk, An introduction to RediSearch, how and when to use the RediSearch for different scenarios is explained.
YouTube: https://www.youtube.com/watch?v=RlY-tprKzxg
An updated talk about how to use Solr for logs and other time-series data, like metrics and social media. In 2016, Solr, its ecosystem, and the operating systems it runs on have evolved quite a lot, so we can now show new techniques to scale and new knobs to tune.
We'll start by looking at how to scale SolrCloud through a hybrid approach using a combination of time- and size-based indices, and also how to divide the cluster in tiers in order to handle the potentially spiky load in real-time. Then, we'll look at tuning individual nodes. We'll cover everything from commits, buffers, merge policies and doc values to OS settings like disk scheduler, SSD caching, and huge pages.
Finally, we'll take a look at the pipeline of getting the logs to Solr and how to make it fast and reliable: where should buffers live, which protocols to use, where should the heavy processing be done (like parsing unstructured data), and which tools from the ecosystem can help.
Accelerating Local Search with PostgreSQL (KNN-Search)Jonathan Katz
KNN-GiST indexes were added in PostgreSQL 9.1 and greatly accelerate some common queries in the geospatial and textual search realms. This presentation will demonstrate the power of KNN-GiST indexes on geospatial and text searching queries, but also their present limitations through some of my experimentations. I will also discuss some of the theory behind KNN (k-nearest neighbor) as well as some of the applications this feature can be applied too.
To see a version of the talk given at PostgresOpen 2011, please visit http://www.youtube.com/watch?v=N-MD08QqGEM
Practical continuous quality gates for development processAndrii Soldatenko
There are a lot of books and publications about the continuous integration in the world. But in my experience it’s difficult to find information about how to open quality gates between automated tests and to continuous integration practice to in your current project. After reading several articles and even a couple of books you will understand how to work with it. But what next? I will share with you practical tips and tricks on how to lift iron curtain to your automated tests before a continuous quality practice today. It is for this reason why I am pleased to share with you my acquired experience in my presentation.
Understanding Graph Databases with Neo4j and CypherRuhaim Izmeth
Inroduction of Grapgh database concepts, explained by comparing the widely popular relational databases and the the sql query language. Neo4j and cypher is used to describe how graph databases work in real life
Node collaboration - Exported Resources and PuppetDBm_richardson
Node Collaboration - How can your servers share information with each other. Exploring Exported Resources, PuppetDB and other methods.
This talk was given at Sydney Puppet Users Meetup on 14/08/2014.
Webscale PostgreSQL - JSONB and Horizontal Scaling StrategiesJonathan Katz
All data is relational and can be represented through relational algebra, right? Perhaps, but there are other ways to represent data, and the PostgreSQL team continues to work on making it easier and more efficient to do so!
With the upcoming 9.4 release, PostgreSQL is introducing the "JSONB" data type which allows for fast, compressed, storage of JSON formatted data, and for quick retrieval. And JSONB comes with all the benefits of PostgreSQL, like its data durability, MVCC, and of course, access to all the other data types and features in PostgreSQL.
How fast is JSONB? How do we access data stored with this type? What can it do with the rest of PostgreSQL? What can't it do? How can we leverage this new data type and make PostgreSQL scale horizontally? Follow along with our presentation as we try to answer these questions.
An overview of how a web search engine is organized is provided. A key component of the AltaVista search engine: its indexing library, is described in more depth. The library manages a set of inverted files, and provides mechanisms to construct and optimize complex queries on those inverted files. The design goals were to enable efficient queries on bodies of text up to a few hundred gigabytes in size (e.g. AltaVista) without sacrificing too much generality, and without giving up on small applications (e.g. mail directories).
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppet
James Sweeney presents on "PuppetDB: A Single Source for Storing Your Puppet Data" at Puppet User Group NYC.
Video: http://www.youtube.com/watch?v=HTr4b02aU7A
Puppet NYC: http://www.meetup.com/puppetnyc-meetings/
Doing Horrible Things with DNS - Web Directions SouthTom Croucher
How can we make use of DNS to improve the performance of web sites? A simple introduction to DNS and a neat trick to improve web site performance using DNS.
In this talk, An introduction to RediSearch, how and when to use the RediSearch for different scenarios is explained.
YouTube: https://www.youtube.com/watch?v=RlY-tprKzxg
An updated talk about how to use Solr for logs and other time-series data, like metrics and social media. In 2016, Solr, its ecosystem, and the operating systems it runs on have evolved quite a lot, so we can now show new techniques to scale and new knobs to tune.
We'll start by looking at how to scale SolrCloud through a hybrid approach using a combination of time- and size-based indices, and also how to divide the cluster in tiers in order to handle the potentially spiky load in real-time. Then, we'll look at tuning individual nodes. We'll cover everything from commits, buffers, merge policies and doc values to OS settings like disk scheduler, SSD caching, and huge pages.
Finally, we'll take a look at the pipeline of getting the logs to Solr and how to make it fast and reliable: where should buffers live, which protocols to use, where should the heavy processing be done (like parsing unstructured data), and which tools from the ecosystem can help.
Accelerating Local Search with PostgreSQL (KNN-Search)Jonathan Katz
KNN-GiST indexes were added in PostgreSQL 9.1 and greatly accelerate some common queries in the geospatial and textual search realms. This presentation will demonstrate the power of KNN-GiST indexes on geospatial and text searching queries, but also their present limitations through some of my experimentations. I will also discuss some of the theory behind KNN (k-nearest neighbor) as well as some of the applications this feature can be applied too.
To see a version of the talk given at PostgresOpen 2011, please visit http://www.youtube.com/watch?v=N-MD08QqGEM
Practical continuous quality gates for development processAndrii Soldatenko
There are a lot of books and publications about the continuous integration in the world. But in my experience it’s difficult to find information about how to open quality gates between automated tests and to continuous integration practice to in your current project. After reading several articles and even a couple of books you will understand how to work with it. But what next? I will share with you practical tips and tricks on how to lift iron curtain to your automated tests before a continuous quality practice today. It is for this reason why I am pleased to share with you my acquired experience in my presentation.
“Time is at once the most valuable and the most perishable of all our possessions”. Correspondingly we must know how to improve a quality of the project in the limitted timeframes. The goal of my presentation is improving an execution time of automated functional tests based on Selenium Webdriver, by using, for instance, parallel execution, scaling by distributing tests on several machines, creating strategy for generation of big sets of test data for typical project. I am pleased to share with you my acquired experience in this field.
We live in changeable world, and our applications are also very inconstant. As a result we have to know how to improve project quality. The subject of my presentation is related to the modern approaches of designing and implementing automated functional tests, by using, for instance, design patterns, improving test execution time based parallel execution, scaling by distributing tests on several machines, creating strategy for generation of big sets of test data and setup skeleton for organizing tests for typical Django project. I am pleased to share with you my acquired experience in this field.
Full text search | Speech by Matteo Durighetto | PGDay.IT 2013 Miriade Spa
Slide dell'intervento di Matteo Durighetto al PGDay.IT 2013, Prato, 25 Ottobre 2013
Il Full Text Search nasce dall’esigenza di ricercare parole o loro derivati all’interno di un documento. Infatti non sempre il problema è risolubile con le espressioni regolari, basti pensare ai plurali irregolari (per cui il problema del matching necessità di un dizionario) o al problema di calcolare la similarità di una parola (ad esempio per cercare l’argomento più attinente e farne una classifica).
In questo talk andremo ad esplorare le peculirità di PostgreSQL e le sue potenzialità al riguardo.
Scaling search to a million pages with Solr, Python, and Djangotow21
A talk given to DJUGL on the 26th July 2010, describing and introducing Solr, and discussing how we use it at Timetric to drive navigation across over a million dataseries.
Full text search in PostgreSQL is a flexible and powerful facility to search collection of documents using natural language queries. We will discuss several new improvements of FTS in PostgreSQL 9.6 release, such as phrase search, better dictionaries support and tsvector editing functions. Also, we will present new features currently in development - RUM index support, which enables acceleration of some important kinds of full text queries, new and better ranking function for relevance search, loading dictionaries into shared memory and support for search multilingual content.
A comparison of different solutions for full-text search in web applications using PostgreSQL and other technology. Presented at the PostgreSQL Conference West, in Seattle, October 2009.
Les tests unitaires se sont pas limités au code des applications, des tests peuvent également être effectués sur les données et les schémas des bases de données.
Conférence donnée lors du meetup PostgreSQL le 22 juin 2016 à Nantes
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataAnne Nicolas
GNU poke is a new interactive editor for binary data. Not limited to editing basic ntities such as bits and bytes, it provides a full-fledged procedural, interactive programming language designed to describe data structures and to operate on them. Once a user has defined a structure for binary data (usually matching some file format) she can search, inspect, create, shuffle and modify abstract entities such as ELF relocations, MP3 tags, DWARF expressions, partition table entries, and so on, with primitives resembling simple editing of bits and bytes. The program comes with a library of already written descriptions (or “pickles” in poke parlance) for many binary formats.
GNU poke is useful in many domains. It is very well suited to aid in the development of programs that operate on binary files, such as assemblers and linkers. This was in fact the primary inspiration that brought me to write it: easily injecting flaws into ELF files in order to reproduce toolchain bugs. Also, due to its flexibility, poke is also very useful for reverse engineering, where the real structure of the data being edited is discovered by experiment, interactively. It is also good for the fast development of prototypes for programs like linkers, compressors or filters, and it provides a convenient foundation to write other utilities such as diff and patch tools for binary files.
This talk (unlike Gaul) is divided into four parts. First I will introduce the program and show what it does: from simple bits/bytes editing to user-defined structures. Then I will show some of the internals, and how poke is implemented. The third block will cover the way of using Poke to describe user data, which is to say the art of writing “pickles”. The presentation ends with a status of the project, a call for hackers, and a hint at future works.
Jose E. Marchesi
Source http://www.slideshare.net/SignisVavere
Signis Vāvere - senior database analyst in the second biggest bank in Latvia
Tēma: Oracle DBA utilītas - Standarta un nestandarta risinājumi izmantojot shell
Valoda: Latviešu
Tēmas apraksts:
Apraksts: Kā tikt pie shell, shell kā tāds, biežāk izmantojamās konstrukcijas, nepieciešamākās komandas, servera informācija, monitorings, kļūdu meklēšana. Daži reālās dzīves piemēri, kā shell zināšana atvieglo un paātrina ikdienas DBA darbu.
Andreas Zeller's keynote at the 1st Intl Fuzzing workshop 2022 at NDSS: https://fuzzingworkshop.github.io/program.html
Do you fuzz your own program, or do you fuzz someone else's program? The answer to this question has vast consequences on your view on fuzzing. Fuzzing someone else's program is the typical adverse "security tester" perspective, where you want your fuzzer to be as automatic and versatile as possible. Fuzzing your own code, however, is more like a traditional tester perspective, where you may assume some knowledge about the program and its context, but may also want to _exploit_ this knowledge - say, to direct the fuzzer to critical locations.
In this talk, I detail these differences in perspectives and assumptions, and highlight their consequences for fuzzer design and research. I also highlight cultural differences in the research communities, and what happens if you submit a paper to the wrong community. I close with an outlook into our newest frameworks, set to reconcile these perspectives by giving users unprecedented control over fuzzing, yet staying fully automatic if need be.
Bio: Andreas Zeller is faculty at the CISPA Helmholtz Center for Information Security and a professor for Software Engineering at Saarland University, both in Saarbrücken, Germany. His research on automated debugging, mining software archives, specification mining, and security testing has won several awards for its impact in academia and industry. Zeller is an ACM Fellow, an IFIP Fellow, an ERC Advanced Grant Awardee, and holds an ACM SIGSOFT Outstanding Research Award.
pg_proctab: Accessing System Stats in PostgreSQLMark Wong
pg_proctab is a collection of PostgreSQL stored functions that provide access to the operating system process table using SQL. We'll show you which functions are available and where they collect the data, and give examples of their use to collect processor and I/O statistics on SQL queries. These stored functions currently only work on Linux-based systems.
Recently the interest in concurrent programming has grown dramatically. Unfortunately, parallel programs do not always have reproducible behavior. Even when they are run with the same inputs, their results can be radically different. In this talk I’ll show how to debug concurrency programs in Go.
I’ll start from showing how you can debug your gorotines using delve and gdb debuggers. Then I’ll try to visualize goroutines using different scenarios, sometimes it helps to better understand how things work. Next part of the topic will be about dumping a goroutine stack trace of your application while it’s running and inspect what each goroutine is doing. And I’ll demonstrate how to debug leaking goroutines by tracing the process of how the scheduler runs goroutines on logical processors which are bound to a physical processor via the operating system thread that is attached.
As a bonus i’ll cover debugging tips on how to find deadlocks and how to avoid race conditions in your application.
Serverless is new trend in software development. It’s confusing many developers around the world. In this talk I’ll explain how to build not only crop images or select data from DynamoDB, but build real application, what kind of troubles are we should expect, how to make decision is your task fit into serverless architecture in Python or may be you should use, general approach. How fast serverless applications and more important how to scale it.
Serverless applications in Python sounds, strange isn’t? In this talk I’ll explain how to build not only crop images or select data from DynamoDB, but build real application, what kind of troubles are we should expect, how to make decision is your task fit into serverless architecture in Python or may be you should use, general approach. How fast serverless applications
written in Python, and more important how to scale it.
Serverless applications in Python sounds, strange isn’t? In this talk I’ll explain how to build not only crop images or select data from DynamoDB, but build real application, what kind of troubles are we should expect, how to make decision is your task fit into serverless architecture in Python or may be you should use, general approach. How fast serverless applications written in Python, and more important how to scale it.
Gen Z and the marketplaces - let's translate their needsLaura Szabó
The product workshop focused on exploring the requirements of Generation Z in relation to marketplace dynamics. We delved into their specific needs, examined the specifics in their shopping preferences, and analyzed their preferred methods for accessing information and making purchases within a marketplace. Through the study of real-life cases , we tried to gain valuable insights into enhancing the marketplace experience for Generation Z.
The workshop was held on the DMA Conference in Vienna June 2024.
Italy Agriculture Equipment Market Outlook to 2027harveenkaur52
Agriculture and Animal Care
Ken Research has an expertise in Agriculture and Animal Care sector and offer vast collection of information related to all major aspects such as Agriculture equipment, Crop Protection, Seed, Agriculture Chemical, Fertilizers, Protected Cultivators, Palm Oil, Hybrid Seed, Animal Feed additives and many more.
Our continuous study and findings in agriculture sector provide better insights to companies dealing with related product and services, government and agriculture associations, researchers and students to well understand the present and expected scenario.
Our Animal care category provides solutions on Animal Healthcare and related products and services, including, animal feed additives, vaccination
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
PyCon Russian 2015 - Dive into full text search with python.
1. Dive into
full text search
with Python
Andrii Soldatenko
18-19 September 2015
@a_soldatenko
2. About me:
• Lead QA Automation Engineer at
• Backend Python Developer at
• Speaker at PyCon Ukraine 2014
• Speaker at PyCon Belarus 2015
• @a_soldatenko
8. Simple sentences
1. The quick brown fox jumped over the lazy dog
2. Quick brown foxes leap over lazy dogs in summer
9. Inverted index
Term
Doc_1
Doc_2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Quick
|
|
X
The
|
X
|
brown
|
X
|
X
dog
|
X
|
dogs
|
|
X
fox
|
X
|
foxes
|
|
X
in
|
|
X
jumped
|
X
|
lazy
|
X
|
X
leap
|
|
X
over
|
X
|
X
quick
|
X
|
summer
|
|
X
the
|
X
|
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
10. Inverted index
Term
Doc_1
Doc_2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
brown
|
X
|
X
quick
|
X
|
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Total
|
2
|
1
11. Inverted index:
normalization
Term
Doc_1
Doc_2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
brown
|
X
|
X
dog
|
X
|
X
fox
|
X
|
X
in
|
|
X
jump
|
X
|
X
lazy
|
X
|
X
over
|
X
|
X
quick
|
X
|
X
summer
|
|
X
the
|
X
|
X
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Term
Doc_1
Doc_2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Quick
|
|
X
The
|
X
|
brown
|
X
|
X
dog
|
X
|
dogs
|
|
X
fox
|
X
|
foxes
|
|
X
in
|
|
X
jumped
|
X
|
lazy
|
X
|
X
leap
|
|
X
over
|
X
|
X
quick
|
X
|
summer
|
|
X
the
|
X
|
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
16. Full text search in
PostgreSQL
1. Creating tokens
2. Converting tokens into Lexemes
3. Storing preprocessed documents
17. Full text search in
PostgreSQL
27 built-in configurations for 10 languages
Support of user-defined FTS configurations
Pluggable dictionaries, parsers
Inverted indexes
18. functions to convert
normal text to tsvector
explain
SELECT
'a
fat
cat
sat
on
a
mat
and
ate
a
fat
rat'::tsvector
@@
'cat
&
rat’::tsquery;
QUERY
PLAN
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Result
(cost=0.00..0.01
rows=1
width=0)
(1
row)
explain
SELECT
'fat
&
cow'::tsquery
@@
'a
fat
cat
sat
on
a
mat
and
ate
a
fat
rat'::tsvector;
-‐-‐
false
QUERY
PLAN
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Result
(cost=0.00..0.01
rows=1
width=0)
(1
row)
19. PostgreSQL:
index management
CREATE
FUNCTION
notes_vector_update()
RETURNS
TRIGGER
AS
$$
BEGIN
IF
TG_OP
=
'INSERT'
THEN
new.search_index
=
to_tsvector('pg_catalog.english',
COALESCE(NEW.name,
''));
END
IF;
IF
TG_OP
=
'UPDATE'
THEN
IF
NEW.name
<>
OLD.name
THEN
new.search_index
=
to_tsvector('pg_catalog.english',
COALESCE(NEW.name,
''));
END
IF;
END
IF;
RETURN
NEW;
END
$$
LANGUAGE
'plpgsql';
20. PostgreSQL:
stopwords
SELECT
to_tsvector('english','in
the
list
of
stop
words');
to_tsvector
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
'list':3
'stop':5
'word':6
/usr/pgsql-9.3/share/tsearch_data/english.stop
22. Malcolm Tredinnick's Advice
on Writing SQL in Django :
“︎If you need to write advanced SQL you should write it.
I would balance that by cautioning against
overuse of the raw() and extra() methods.”
23. PostgreSQL full-text search
integration with django orm
https://github.com/linuxlewis/djorm-ext-pgfulltext
from
djorm_pgfulltext.models
import
SearchManager
from
djorm_pgfulltext.fields
import
VectorField
from
django.db
import
models
class
Page(models.Model):
name
=
models.CharField(max_length=200)
description
=
models.TextField()
search_index
=
VectorField()
objects
=
SearchManager(
fields
=
('name',
'description'),
config
=
'pg_catalog.english',
#
this
is
default
search_field
=
'search_index',
#
this
is
default
auto_update_search_field
=
True
)
24. For search just use search
method of the manager
https://github.com/linuxlewis/djorm-ext-pgfulltext
>>>
Page.objects.search("documentation
&
about")
[<Page:
Page:
Home
page>]
>>>
Page.objects.search("about
|
documentation
|
django
|
home",
raw=True)
[<Page:
Page:
Home
page>,
<Page:
Page:
About>,
<Page:
Page:
Navigation>]
25. Second way
class
Page(models.Model):
name
=
models.CharField(max_length=200)
description
=
models.TextField()
objects
=
SearchManager(fields=None,
search_field=None)
>>>
Page.objects.search("documentation
&
about",
fields=('name',
'description'))
[<Page:
Page:
Home
page>]
>>>
Page.objects.search("about
|
documentation
|
django
|
home",
raw=True,
fields=('name',
'description'))
[<Page:
Page:
Home
page>,
<Page:
Page:
About>,
<Page:
Page:
Navigation>]
26. Pros and Cons
Pros:
• Quick implementation
• No dependency
Cons:
• Need manually manage indexes
• Not as flexible as pure search engines
• tied to PostgreSQL
• no analytics data
• no DSL only `&` and `|` queries
• difficult to manage stop words
35. Adding search functionality
to Simple Model
$
cat
myapp/models.py
from
django.db
import
models
from
django.contrib.auth.models
import
User
class
Page(models.Model):
user
=
models.ForeignKey(User)
name
=
models.CharField(max_length=200)
description
=
models.TextField()
def
__unicode__(self):
return
self.name
38. Haystack:
Creating SearchIndexes
$
cat
myapp/search_indexes.py
import
datetime
from
haystack
import
indexes
from
myapp.models
import
Note
class
PageIndex(indexes.SearchIndex,
indexes.Indexable):
text
=
indexes.CharField(document=True,
use_template=True)
author
=
indexes.CharField(model_attr='user')
pub_date
=
indexes.DateTimeField(model_attr='pub_date')
def
get_model(self):
return
Note
def
index_queryset(self,
using=None):
"""Used
when
the
entire
index
for
model
is
updated."""
return
self.get_model().objects.
filter(pub_date__lte=datetime.datetime.now())
39. Haystack:
SearchQuerySet API
from
haystack.query
import
SearchQuerySet
from
haystack.inputs
import
Raw
all_results
=
SearchQuerySet().all()
hello_results
=
SearchQuerySet().filter(content='hello')
unfriendly_results
=
SearchQuerySet().
exclude(content=‘hello’).
filter(content=‘world’)
#
To
send
unescaped
data:
sqs
=
SearchQuerySet().filter(title=Raw(trusted_query))
40. Keeping data in sync
#
Update
everything.
./manage.py
update_index
-‐-‐settings=settings.prod
#
Update
everything
with
lots
of
information
about
what's
going
on.
./manage.py
update_index
-‐-‐settings=settings.prod
-‐-‐verbosity=2
#
Update
everything,
cleaning
up
after
deleted
models.
./manage.py
update_index
-‐-‐remove
-‐-‐settings=settings.prod
#
Update
everything
changed
in
the
last
2
hours.
./manage.py
update_index
-‐-‐age=2
-‐-‐settings=settings.prod
#
Update
everything
between
Dec.
1,
2011
&
Dec
31,
2011
./manage.py
update_index
-‐-‐start='2011-‐12-‐01T00:00:00'
-‐-‐
end='2011-‐12-‐31T23:59:59'
-‐-‐settings=settings.prod
41. Signals
class
RealtimeSignalProcessor(BaseSignalProcessor):
"""
Allows
for
observing
when
saves/deletes
fire
&
automatically
updates
the
search
engine
appropriately.
"""
def
setup(self):
#
Naive
(listen
to
all
model
saves).
models.signals.post_save.connect(self.handle_save)
models.signals.post_delete.connect(self.handle_delete)
#
Efficient
would
be
going
through
all
backends
&
collecting
all
models
#
being
used,
then
hooking
up
signals
only
for
those.
def
teardown(self):
#
Naive
(listen
to
all
model
saves).
models.signals.post_save.disconnect(self.handle_save)
models.signals.post_delete.disconnect(self.handle_delete)
#
Efficient
would
be
going
through
all
backends
&
collecting
all
models
#
being
used,
then
disconnecting
signals
only
for
those.
42. Haystack:
Pros and Cons
Pros:
• easy to setup
• looks like Django ORM but for searches
• search engine independent
• support 4 engines (Elastic, Solr, Xapian, Whoosh)
Cons:
• poor SearchQuerySet API
• difficult to manage stop words
• loose performance, because extra layer
• Model - based
43. Future FTS and
Roadmap Django 1.9
• PostgreSQL Full Text Search (Marc Tamlyn)
https://github.com/django/django/pull/4726
• Custom indexes (Marc Tamlyn)
• etc.