Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Rapid development of 
website search in Python 
PyCon India, 
Bangalore, Sept’ 12 
Chetan Giridhar
For whom! 
 If you’re, 
an experienced developer who has 
implemented search solutions 
currently dirtying your hands 
...
Think web development 
 Core functionality 
 Design patterns 
 Web Interface 
 Usability 
 Scalability 
 Performance...
Search 
 Often considered – ‘good to have’ 
 Enhances user experience 
 Focused information 
 Relevance 
 Interaction...
Typical Search Engine 
 Designing a schema 
 Convert your data as Documents and store 
them to index 
 Document is a se...
Typical Search Engine 
 Indexing 
 Adding documents to the index 
 Query and query parsers 
 Prepare query 
 Parse 
...
Schema 
based 
document 
Index Writer 
Indexing & Committing 
Input 
files 
Field1 
Field3 
Analyzer 
Field2 
In-memory 
I...
Query Parser Analyzer 
Results 
Searching 
Input query 
Index Searcher 
Index
 Sourcing input data set 
 Handling input queries 
 How to search 
 Search engines 
 How to display results 
 Custom...
 Apache Solr: Sunburnt 
 Haystack 
 Xapian: Xappy 
 Elastic Search 
Development: Options 
 Whoosh 
 Lucene: Pylucene
 Pythonic APIs 
 Deployment 
Large scale and 
medium sized web sites 
Talking Pylucene & Whoosh 
 Rapid 
Minimal ins...
Pylucene 
 Pylucene: Python wrappers to Lucene 
 The de-facto standard for search engine library 
 Lucene: an open sour...
Pylucene 
 Simple API 
 High performance indexing 
 Scalable to millions of documents 
 Efficient and feature rich sea...
Whoosh 
 Whoosh is a search engine library 
 Fast indexing and search 
 One of the fastest Python search engine 
 100%...
Whoosh 
 Easy to setup 
 Neutral to web frameworks 
 Powerful query language 
 Feature rich 
 Intuitive APIs
 Document 
 Field 
 IndexWriter 
 QueryParser 
 Analyzer 
 IndexSearcher 
 fields.Schema 
 index.Index 
 qparser....
 Search design should be: 
 An independent component 
Pluggable 
Platform independent 
Assume minimal external depend...
Search.py 
fsMgr
Demo
Comparing Engines 
 Basis of comparison 
 Indexing, Committing and Searching 
 Dataset 
 1 GB data 
 ~5000 files 
 f...
Indexing 
500 
400 
300 
200 
100 
0 
Time to Index 
pylucene whoosh 
time (s)
Committing 
300 
250 
200 
150 
100 
50 
0 
Time to Commit 
pylucene whoosh 
time (s)
Searching 
0.01 
0.008 
0.006 
0.004 
0.002 
0 
Time to Search 
pylucene whoosh 
time (s)
Recommendations 
 Search Engine Library 
No one solution fits all problems 
Search engine abstraction is the key 
Scal...
 Getting rapid and easier by the day 
 Web frameworks 
Web development in Python 
 Django, Pylons 
 Http Servers 
 To...
References 
 Whoosh 
 https://bitbucket.org/mchaput/whoosh/wiki/Home 
 Pylucene 
 http://lucene.apache.org/pylucene/ 
...
References 
 Chetan’s tech space 
 http://technobeans.com 
 Vishal’s technical blog 
 http://freethreads.net
Q and A
Backup
Whoosh v/s Haystack v/s Xapian 
• Whoosh is suitable for a small project. Limited 
scalability for search and indexing 
– ...
Lucene v/s Database search 
• There are a number of query types that RDBMSs in general do not 
support without vendor exte...
• Indexing 
– Convert files to a format for quick 
look up 
– Fast random access to stored words 
• Searching 
– Specify k...
Advanced Searching 
 Morelikethis 
 didyoumean
Upcoming SlideShare
Loading in …5
×

PyCon India 2012: Rapid development of website search in python

  • Be the first to comment

PyCon India 2012: Rapid development of website search in python

  1. 1. Rapid development of website search in Python PyCon India, Bangalore, Sept’ 12 Chetan Giridhar
  2. 2. For whom!  If you’re, an experienced developer who has implemented search solutions currently dirtying your hands prototyping website search for your startup dreading to learn Java  just curious..
  3. 3. Think web development  Core functionality  Design patterns  Web Interface  Usability  Scalability  Performance  …?
  4. 4. Search  Often considered – ‘good to have’  Enhances user experience  Focused information  Relevance  Interaction  Ranked searching
  5. 5. Typical Search Engine  Designing a schema  Convert your data as Documents and store them to index  Document is a set of fields  Field is a name=value pair  {title = “python”, content = “computer”, tag = “language”}  Analyzers  "parse" each field of your data into index-able "tokens" or keywords.  “Welcome to Pycon" it will produce list [“welcome", “to", “Pycon”]
  6. 6. Typical Search Engine  Indexing  Adding documents to the index  Query and query parsers  Prepare query  Parse  Analyze  Searching  Lookup index
  7. 7. Schema based document Index Writer Indexing & Committing Input files Field1 Field3 Analyzer Field2 In-memory Index Committed
  8. 8. Query Parser Analyzer Results Searching Input query Index Searcher Index
  9. 9.  Sourcing input data set  Handling input queries  How to search  Search engines  How to display results  Customization Development : Considerations
  10. 10.  Apache Solr: Sunburnt  Haystack  Xapian: Xappy  Elastic Search Development: Options  Whoosh  Lucene: Pylucene
  11. 11.  Pythonic APIs  Deployment Large scale and medium sized web sites Talking Pylucene & Whoosh  Rapid Minimal installation Clear Documentation Quick Setup Ease of Integration
  12. 12. Pylucene  Pylucene: Python wrappers to Lucene  The de-facto standard for search engine library  Lucene: an open source, pure Java, search engine library  Embeds a Java VM with Lucene into a Python process
  13. 13. Pylucene  Simple API  High performance indexing  Scalable to millions of documents  Efficient and feature rich search algorithms  Cross platform
  14. 14. Whoosh  Whoosh is a search engine library  Fast indexing and search  One of the fastest Python search engine  100% Python code  Extensible code  No external dependency  Active development and support
  15. 15. Whoosh  Easy to setup  Neutral to web frameworks  Powerful query language  Feature rich  Intuitive APIs
  16. 16.  Document  Field  IndexWriter  QueryParser  Analyzer  IndexSearcher  fields.Schema  index.Index  qparser.QueryParser  analysis. Analyzer  searching.Searcher PyLucene Whoosh
  17. 17.  Search design should be:  An independent component Pluggable Platform independent Assume minimal external dependency Easily extendible Seamless integration Designing search in websites
  18. 18. Search.py fsMgr
  19. 19. Demo
  20. 20. Comparing Engines  Basis of comparison  Indexing, Committing and Searching  Dataset  1 GB data  ~5000 files  file size ranging between 1KB to 50MB  Setup  Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2 3 GB RAM  Ubuntu Release 12.04 (precise) 32-bit
  21. 21. Indexing 500 400 300 200 100 0 Time to Index pylucene whoosh time (s)
  22. 22. Committing 300 250 200 150 100 50 0 Time to Commit pylucene whoosh time (s)
  23. 23. Searching 0.01 0.008 0.006 0.004 0.002 0 Time to Search pylucene whoosh time (s)
  24. 24. Recommendations  Search Engine Library No one solution fits all problems Search engine abstraction is the key Scalability is critical Rapid to setup, develop and tweak Understand and use 
  25. 25.  Getting rapid and easier by the day  Web frameworks Web development in Python  Django, Pylons  Http Servers  Tornado, Gunicorn  Support for SQL/NoSQL databases MySQL-python, pymongo  Template Engines  Cheetah, jinja2  Search  Pylucene, Whoosh
  26. 26. References  Whoosh  https://bitbucket.org/mchaput/whoosh/wiki/Home  Pylucene  http://lucene.apache.org/pylucene/  http://lucene.apache.org/core/3_6_1/api/all/index.html  Xappy  http://code.google.com/p/xappy/  ElasticSearch  http://www.elasticsearch.org/guide/reference/api/
  27. 27. References  Chetan’s tech space  http://technobeans.com  Vishal’s technical blog  http://freethreads.net
  28. 28. Q and A
  29. 29. Backup
  30. 30. Whoosh v/s Haystack v/s Xapian • Whoosh is suitable for a small project. Limited scalability for search and indexing – A good beginning • Haystack is appropriate with Django • Xapian is ultra fast, but is not as feature rich as Solr • Lucene is not distributed; has external dependency
  31. 31. Lucene v/s Database search • There are a number of query types that RDBMSs in general do not support without vendor extensions: • Fuzzy queries, in which "fuzzy" and "wuzzy" are considered matches • Word stemming queries, which consider "take," "took," and "taken" to be identical • Sound-like queries, which consider "cat" and "kat" to be identical • Synonym queries, which consider "jump," "hop," and "leap" to be identical • Queries on binary BLOB data types, such as PDF documents, Microsoft Word or Excel documents, or HTML and XML documents • More disappointingly, SQL search results are not ranked by match-relevance scores. The SQL standard is simply not intended for full-text querying.
  32. 32. • Indexing – Convert files to a format for quick look up – Fast random access to stored words • Searching – Specify keywords • Displaying – Lookup documents that are relevant – Ranking – Different types of queries Typical search engine
  33. 33. Advanced Searching  Morelikethis  didyoumean

×