Your SlideShare is downloading. ×
0
Rapid development of 
website search in Python 
PyCon India, 
Bangalore, Sept’ 12 
Chetan Giridhar
For whom! 
 If you’re, 
an experienced developer who has 
implemented search solutions 
currently dirtying your hands 
...
Think web development 
 Core functionality 
 Design patterns 
 Web Interface 
 Usability 
 Scalability 
 Performance...
Search 
 Often considered – ‘good to have’ 
 Enhances user experience 
 Focused information 
 Relevance 
 Interaction...
Typical Search Engine 
 Designing a schema 
 Convert your data as Documents and store 
them to index 
 Document is a se...
Typical Search Engine 
 Indexing 
 Adding documents to the index 
 Query and query parsers 
 Prepare query 
 Parse 
...
Schema 
based 
document 
Index Writer 
Indexing & Committing 
Input 
files 
Field1 
Field3 
Analyzer 
Field2 
In-memory 
I...
Query Parser Analyzer 
Results 
Searching 
Input query 
Index Searcher 
Index
 Sourcing input data set 
 Handling input queries 
 How to search 
 Search engines 
 How to display results 
 Custom...
 Apache Solr: Sunburnt 
 Haystack 
 Xapian: Xappy 
 Elastic Search 
Development: Options 
 Whoosh 
 Lucene: Pylucene
 Pythonic APIs 
 Deployment 
Large scale and 
medium sized web sites 
Talking Pylucene & Whoosh 
 Rapid 
Minimal ins...
Pylucene 
 Pylucene: Python wrappers to Lucene 
 The de-facto standard for search engine library 
 Lucene: an open sour...
Pylucene 
 Simple API 
 High performance indexing 
 Scalable to millions of documents 
 Efficient and feature rich sea...
Whoosh 
 Whoosh is a search engine library 
 Fast indexing and search 
 One of the fastest Python search engine 
 100%...
Whoosh 
 Easy to setup 
 Neutral to web frameworks 
 Powerful query language 
 Feature rich 
 Intuitive APIs
 Document 
 Field 
 IndexWriter 
 QueryParser 
 Analyzer 
 IndexSearcher 
 fields.Schema 
 index.Index 
 qparser....
 Search design should be: 
 An independent component 
Pluggable 
Platform independent 
Assume minimal external depend...
Search.py 
fsMgr
Demo
Comparing Engines 
 Basis of comparison 
 Indexing, Committing and Searching 
 Dataset 
 1 GB data 
 ~5000 files 
 f...
Indexing 
500 
400 
300 
200 
100 
0 
Time to Index 
pylucene whoosh 
time (s)
Committing 
300 
250 
200 
150 
100 
50 
0 
Time to Commit 
pylucene whoosh 
time (s)
Searching 
0.01 
0.008 
0.006 
0.004 
0.002 
0 
Time to Search 
pylucene whoosh 
time (s)
Recommendations 
 Search Engine Library 
No one solution fits all problems 
Search engine abstraction is the key 
Scal...
 Getting rapid and easier by the day 
 Web frameworks 
Web development in Python 
 Django, Pylons 
 Http Servers 
 To...
References 
 Whoosh 
 https://bitbucket.org/mchaput/whoosh/wiki/Home 
 Pylucene 
 http://lucene.apache.org/pylucene/ 
...
References 
 Chetan’s tech space 
 http://technobeans.com 
 Vishal’s technical blog 
 http://freethreads.net
Q and A
Backup
Whoosh v/s Haystack v/s Xapian 
• Whoosh is suitable for a small project. Limited 
scalability for search and indexing 
– ...
Lucene v/s Database search 
• There are a number of query types that RDBMSs in general do not 
support without vendor exte...
• Indexing 
– Convert files to a format for quick 
look up 
– Fast random access to stored words 
• Searching 
– Specify k...
Advanced Searching 
 Morelikethis 
 didyoumean
Upcoming SlideShare
Loading in...5
×

PyCon India 2012: Rapid development of website search in python

1,275

Published on

Published in: Entertainment & Humor
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,275
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Whoosh? If you love Python more than learning Java.
  • Transcript of "PyCon India 2012: Rapid development of website search in python"

    1. 1. Rapid development of website search in Python PyCon India, Bangalore, Sept’ 12 Chetan Giridhar
    2. 2. For whom!  If you’re, an experienced developer who has implemented search solutions currently dirtying your hands prototyping website search for your startup dreading to learn Java  just curious..
    3. 3. Think web development  Core functionality  Design patterns  Web Interface  Usability  Scalability  Performance  …?
    4. 4. Search  Often considered – ‘good to have’  Enhances user experience  Focused information  Relevance  Interaction  Ranked searching
    5. 5. Typical Search Engine  Designing a schema  Convert your data as Documents and store them to index  Document is a set of fields  Field is a name=value pair  {title = “python”, content = “computer”, tag = “language”}  Analyzers  "parse" each field of your data into index-able "tokens" or keywords.  “Welcome to Pycon" it will produce list [“welcome", “to", “Pycon”]
    6. 6. Typical Search Engine  Indexing  Adding documents to the index  Query and query parsers  Prepare query  Parse  Analyze  Searching  Lookup index
    7. 7. Schema based document Index Writer Indexing & Committing Input files Field1 Field3 Analyzer Field2 In-memory Index Committed
    8. 8. Query Parser Analyzer Results Searching Input query Index Searcher Index
    9. 9.  Sourcing input data set  Handling input queries  How to search  Search engines  How to display results  Customization Development : Considerations
    10. 10.  Apache Solr: Sunburnt  Haystack  Xapian: Xappy  Elastic Search Development: Options  Whoosh  Lucene: Pylucene
    11. 11.  Pythonic APIs  Deployment Large scale and medium sized web sites Talking Pylucene & Whoosh  Rapid Minimal installation Clear Documentation Quick Setup Ease of Integration
    12. 12. Pylucene  Pylucene: Python wrappers to Lucene  The de-facto standard for search engine library  Lucene: an open source, pure Java, search engine library  Embeds a Java VM with Lucene into a Python process
    13. 13. Pylucene  Simple API  High performance indexing  Scalable to millions of documents  Efficient and feature rich search algorithms  Cross platform
    14. 14. Whoosh  Whoosh is a search engine library  Fast indexing and search  One of the fastest Python search engine  100% Python code  Extensible code  No external dependency  Active development and support
    15. 15. Whoosh  Easy to setup  Neutral to web frameworks  Powerful query language  Feature rich  Intuitive APIs
    16. 16.  Document  Field  IndexWriter  QueryParser  Analyzer  IndexSearcher  fields.Schema  index.Index  qparser.QueryParser  analysis. Analyzer  searching.Searcher PyLucene Whoosh
    17. 17.  Search design should be:  An independent component Pluggable Platform independent Assume minimal external dependency Easily extendible Seamless integration Designing search in websites
    18. 18. Search.py fsMgr
    19. 19. Demo
    20. 20. Comparing Engines  Basis of comparison  Indexing, Committing and Searching  Dataset  1 GB data  ~5000 files  file size ranging between 1KB to 50MB  Setup  Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2 3 GB RAM  Ubuntu Release 12.04 (precise) 32-bit
    21. 21. Indexing 500 400 300 200 100 0 Time to Index pylucene whoosh time (s)
    22. 22. Committing 300 250 200 150 100 50 0 Time to Commit pylucene whoosh time (s)
    23. 23. Searching 0.01 0.008 0.006 0.004 0.002 0 Time to Search pylucene whoosh time (s)
    24. 24. Recommendations  Search Engine Library No one solution fits all problems Search engine abstraction is the key Scalability is critical Rapid to setup, develop and tweak Understand and use 
    25. 25.  Getting rapid and easier by the day  Web frameworks Web development in Python  Django, Pylons  Http Servers  Tornado, Gunicorn  Support for SQL/NoSQL databases MySQL-python, pymongo  Template Engines  Cheetah, jinja2  Search  Pylucene, Whoosh
    26. 26. References  Whoosh  https://bitbucket.org/mchaput/whoosh/wiki/Home  Pylucene  http://lucene.apache.org/pylucene/  http://lucene.apache.org/core/3_6_1/api/all/index.html  Xappy  http://code.google.com/p/xappy/  ElasticSearch  http://www.elasticsearch.org/guide/reference/api/
    27. 27. References  Chetan’s tech space  http://technobeans.com  Vishal’s technical blog  http://freethreads.net
    28. 28. Q and A
    29. 29. Backup
    30. 30. Whoosh v/s Haystack v/s Xapian • Whoosh is suitable for a small project. Limited scalability for search and indexing – A good beginning • Haystack is appropriate with Django • Xapian is ultra fast, but is not as feature rich as Solr • Lucene is not distributed; has external dependency
    31. 31. Lucene v/s Database search • There are a number of query types that RDBMSs in general do not support without vendor extensions: • Fuzzy queries, in which "fuzzy" and "wuzzy" are considered matches • Word stemming queries, which consider "take," "took," and "taken" to be identical • Sound-like queries, which consider "cat" and "kat" to be identical • Synonym queries, which consider "jump," "hop," and "leap" to be identical • Queries on binary BLOB data types, such as PDF documents, Microsoft Word or Excel documents, or HTML and XML documents • More disappointingly, SQL search results are not ranked by match-relevance scores. The SQL standard is simply not intended for full-text querying.
    32. 32. • Indexing – Convert files to a format for quick look up – Fast random access to stored words • Searching – Specify keywords • Displaying – Lookup documents that are relevant – Ranking – Different types of queries Typical search engine
    33. 33. Advanced Searching  Morelikethis  didyoumean
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×