PyCon India 2012: Rapid development of website search in python
1. Rapid development of
website search in Python
PyCon India,
Bangalore, Sept’ 12
Chetan Giridhar
2. For whom!
If you’re,
an experienced developer who has
implemented search solutions
currently dirtying your hands
prototyping website search for your startup
dreading to learn Java
just curious..
3. Think web development
Core functionality
Design patterns
Web Interface
Usability
Scalability
Performance
…?
4. Search
Often considered – ‘good to have’
Enhances user experience
Focused information
Relevance
Interaction
Ranked searching
5. Typical Search Engine
Designing a schema
Convert your data as Documents and store
them to index
Document is a set of fields
Field is a name=value pair
{title = “python”, content = “computer”,
tag = “language”}
Analyzers
"parse" each field of your data into index-able
"tokens" or keywords.
“Welcome to Pycon" it will produce list
[“welcome", “to", “Pycon”]
6. Typical Search Engine
Indexing
Adding documents to the index
Query and query parsers
Prepare query
Parse
Analyze
Searching
Lookup index
7. Schema
based
document
Index Writer
Indexing & Committing
Input
files
Field1
Field3
Analyzer
Field2
In-memory
Index
Committed
9. Sourcing input data set
Handling input queries
How to search
Search engines
How to display results
Customization
Development : Considerations
11. Pythonic APIs
Deployment
Large scale and
medium sized web sites
Talking Pylucene & Whoosh
Rapid
Minimal installation
Clear Documentation
Quick Setup
Ease of Integration
12. Pylucene
Pylucene: Python wrappers to Lucene
The de-facto standard for search engine library
Lucene: an open source, pure Java, search
engine library
Embeds a Java VM with Lucene into a Python
process
13. Pylucene
Simple API
High performance indexing
Scalable to millions of documents
Efficient and feature rich search algorithms
Cross platform
14. Whoosh
Whoosh is a search engine library
Fast indexing and search
One of the fastest Python search engine
100% Python code
Extensible code
No external dependency
Active development and support
15. Whoosh
Easy to setup
Neutral to web frameworks
Powerful query language
Feature rich
Intuitive APIs
20. Comparing Engines
Basis of comparison
Indexing, Committing and Searching
Dataset
1 GB data
~5000 files
file size ranging between 1KB to 50MB
Setup
Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2
3 GB RAM
Ubuntu Release 12.04 (precise) 32-bit
21. Indexing
500
400
300
200
100
0
Time to Index
pylucene whoosh
time (s)
22. Committing
300
250
200
150
100
50
0
Time to Commit
pylucene whoosh
time (s)
23. Searching
0.01
0.008
0.006
0.004
0.002
0
Time to Search
pylucene whoosh
time (s)
24. Recommendations
Search Engine Library
No one solution fits all problems
Search engine abstraction is the key
Scalability is critical
Rapid to setup, develop and tweak
Understand and use
25. Getting rapid and easier by the day
Web frameworks
Web development in Python
Django, Pylons
Http Servers
Tornado, Gunicorn
Support for SQL/NoSQL databases
MySQL-python, pymongo
Template Engines
Cheetah, jinja2
Search
Pylucene, Whoosh
30. Whoosh v/s Haystack v/s Xapian
• Whoosh is suitable for a small project. Limited
scalability for search and indexing
– A good beginning
• Haystack is appropriate with Django
• Xapian is ultra fast, but is not as feature rich as
Solr
• Lucene is not distributed; has external
dependency
31. Lucene v/s Database search
• There are a number of query types that RDBMSs in general do not
support without vendor extensions:
• Fuzzy queries, in which "fuzzy" and "wuzzy" are considered
matches
• Word stemming queries, which consider "take," "took," and "taken"
to be identical
• Sound-like queries, which consider "cat" and "kat" to be identical
• Synonym queries, which consider "jump," "hop," and "leap" to be
identical
• Queries on binary BLOB data types, such as PDF documents,
Microsoft Word or Excel documents, or HTML and XML documents
• More disappointingly, SQL search results are not ranked by match-relevance
scores. The SQL standard is simply not intended for full-text
querying.
32. • Indexing
– Convert files to a format for quick
look up
– Fast random access to stored words
• Searching
– Specify keywords
• Displaying
– Lookup documents that are
relevant
– Ranking
– Different types of queries
Typical search engine