PyCon India 2012: Rapid development of website search in python

  • 1,182 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,182
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
9
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Whoosh? If you love Python more than learning Java.

Transcript

  • 1. Rapid development of website search in Python PyCon India, Bangalore, Sept’ 12 Chetan Giridhar
  • 2. For whom!  If you’re, an experienced developer who has implemented search solutions currently dirtying your hands prototyping website search for your startup dreading to learn Java  just curious..
  • 3. Think web development  Core functionality  Design patterns  Web Interface  Usability  Scalability  Performance  …?
  • 4. Search  Often considered – ‘good to have’  Enhances user experience  Focused information  Relevance  Interaction  Ranked searching
  • 5. Typical Search Engine  Designing a schema  Convert your data as Documents and store them to index  Document is a set of fields  Field is a name=value pair  {title = “python”, content = “computer”, tag = “language”}  Analyzers  "parse" each field of your data into index-able "tokens" or keywords.  “Welcome to Pycon" it will produce list [“welcome", “to", “Pycon”]
  • 6. Typical Search Engine  Indexing  Adding documents to the index  Query and query parsers  Prepare query  Parse  Analyze  Searching  Lookup index
  • 7. Schema based document Index Writer Indexing & Committing Input files Field1 Field3 Analyzer Field2 In-memory Index Committed
  • 8. Query Parser Analyzer Results Searching Input query Index Searcher Index
  • 9.  Sourcing input data set  Handling input queries  How to search  Search engines  How to display results  Customization Development : Considerations
  • 10.  Apache Solr: Sunburnt  Haystack  Xapian: Xappy  Elastic Search Development: Options  Whoosh  Lucene: Pylucene
  • 11.  Pythonic APIs  Deployment Large scale and medium sized web sites Talking Pylucene & Whoosh  Rapid Minimal installation Clear Documentation Quick Setup Ease of Integration
  • 12. Pylucene  Pylucene: Python wrappers to Lucene  The de-facto standard for search engine library  Lucene: an open source, pure Java, search engine library  Embeds a Java VM with Lucene into a Python process
  • 13. Pylucene  Simple API  High performance indexing  Scalable to millions of documents  Efficient and feature rich search algorithms  Cross platform
  • 14. Whoosh  Whoosh is a search engine library  Fast indexing and search  One of the fastest Python search engine  100% Python code  Extensible code  No external dependency  Active development and support
  • 15. Whoosh  Easy to setup  Neutral to web frameworks  Powerful query language  Feature rich  Intuitive APIs
  • 16.  Document  Field  IndexWriter  QueryParser  Analyzer  IndexSearcher  fields.Schema  index.Index  qparser.QueryParser  analysis. Analyzer  searching.Searcher PyLucene Whoosh
  • 17.  Search design should be:  An independent component Pluggable Platform independent Assume minimal external dependency Easily extendible Seamless integration Designing search in websites
  • 18. Search.py fsMgr
  • 19. Demo
  • 20. Comparing Engines  Basis of comparison  Indexing, Committing and Searching  Dataset  1 GB data  ~5000 files  file size ranging between 1KB to 50MB  Setup  Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2 3 GB RAM  Ubuntu Release 12.04 (precise) 32-bit
  • 21. Indexing 500 400 300 200 100 0 Time to Index pylucene whoosh time (s)
  • 22. Committing 300 250 200 150 100 50 0 Time to Commit pylucene whoosh time (s)
  • 23. Searching 0.01 0.008 0.006 0.004 0.002 0 Time to Search pylucene whoosh time (s)
  • 24. Recommendations  Search Engine Library No one solution fits all problems Search engine abstraction is the key Scalability is critical Rapid to setup, develop and tweak Understand and use 
  • 25.  Getting rapid and easier by the day  Web frameworks Web development in Python  Django, Pylons  Http Servers  Tornado, Gunicorn  Support for SQL/NoSQL databases MySQL-python, pymongo  Template Engines  Cheetah, jinja2  Search  Pylucene, Whoosh
  • 26. References  Whoosh  https://bitbucket.org/mchaput/whoosh/wiki/Home  Pylucene  http://lucene.apache.org/pylucene/  http://lucene.apache.org/core/3_6_1/api/all/index.html  Xappy  http://code.google.com/p/xappy/  ElasticSearch  http://www.elasticsearch.org/guide/reference/api/
  • 27. References  Chetan’s tech space  http://technobeans.com  Vishal’s technical blog  http://freethreads.net
  • 28. Q and A
  • 29. Backup
  • 30. Whoosh v/s Haystack v/s Xapian • Whoosh is suitable for a small project. Limited scalability for search and indexing – A good beginning • Haystack is appropriate with Django • Xapian is ultra fast, but is not as feature rich as Solr • Lucene is not distributed; has external dependency
  • 31. Lucene v/s Database search • There are a number of query types that RDBMSs in general do not support without vendor extensions: • Fuzzy queries, in which "fuzzy" and "wuzzy" are considered matches • Word stemming queries, which consider "take," "took," and "taken" to be identical • Sound-like queries, which consider "cat" and "kat" to be identical • Synonym queries, which consider "jump," "hop," and "leap" to be identical • Queries on binary BLOB data types, such as PDF documents, Microsoft Word or Excel documents, or HTML and XML documents • More disappointingly, SQL search results are not ranked by match-relevance scores. The SQL standard is simply not intended for full-text querying.
  • 32. • Indexing – Convert files to a format for quick look up – Fast random access to stored words • Searching – Specify keywords • Displaying – Lookup documents that are relevant – Ranking – Different types of queries Typical search engine
  • 33. Advanced Searching  Morelikethis  didyoumean