Your SlideShare is downloading. ×
  • Like
PyCon India 2012: Rapid development of website search in python
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

PyCon India 2012: Rapid development of website search in python

  • 1,200 views
Published

 

Published in Entertainment & Humor
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,200
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
9
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Whoosh? If you love Python more than learning Java.

Transcript

  • 1. Rapid development of website search in Python PyCon India, Bangalore, Sept’ 12 Chetan Giridhar
  • 2. For whom!  If you’re, an experienced developer who has implemented search solutions currently dirtying your hands prototyping website search for your startup dreading to learn Java  just curious..
  • 3. Think web development  Core functionality  Design patterns  Web Interface  Usability  Scalability  Performance  …?
  • 4. Search  Often considered – ‘good to have’  Enhances user experience  Focused information  Relevance  Interaction  Ranked searching
  • 5. Typical Search Engine  Designing a schema  Convert your data as Documents and store them to index  Document is a set of fields  Field is a name=value pair  {title = “python”, content = “computer”, tag = “language”}  Analyzers  "parse" each field of your data into index-able "tokens" or keywords.  “Welcome to Pycon" it will produce list [“welcome", “to", “Pycon”]
  • 6. Typical Search Engine  Indexing  Adding documents to the index  Query and query parsers  Prepare query  Parse  Analyze  Searching  Lookup index
  • 7. Schema based document Index Writer Indexing & Committing Input files Field1 Field3 Analyzer Field2 In-memory Index Committed
  • 8. Query Parser Analyzer Results Searching Input query Index Searcher Index
  • 9.  Sourcing input data set  Handling input queries  How to search  Search engines  How to display results  Customization Development : Considerations
  • 10.  Apache Solr: Sunburnt  Haystack  Xapian: Xappy  Elastic Search Development: Options  Whoosh  Lucene: Pylucene
  • 11.  Pythonic APIs  Deployment Large scale and medium sized web sites Talking Pylucene & Whoosh  Rapid Minimal installation Clear Documentation Quick Setup Ease of Integration
  • 12. Pylucene  Pylucene: Python wrappers to Lucene  The de-facto standard for search engine library  Lucene: an open source, pure Java, search engine library  Embeds a Java VM with Lucene into a Python process
  • 13. Pylucene  Simple API  High performance indexing  Scalable to millions of documents  Efficient and feature rich search algorithms  Cross platform
  • 14. Whoosh  Whoosh is a search engine library  Fast indexing and search  One of the fastest Python search engine  100% Python code  Extensible code  No external dependency  Active development and support
  • 15. Whoosh  Easy to setup  Neutral to web frameworks  Powerful query language  Feature rich  Intuitive APIs
  • 16.  Document  Field  IndexWriter  QueryParser  Analyzer  IndexSearcher  fields.Schema  index.Index  qparser.QueryParser  analysis. Analyzer  searching.Searcher PyLucene Whoosh
  • 17.  Search design should be:  An independent component Pluggable Platform independent Assume minimal external dependency Easily extendible Seamless integration Designing search in websites
  • 18. Search.py fsMgr
  • 19. Demo
  • 20. Comparing Engines  Basis of comparison  Indexing, Committing and Searching  Dataset  1 GB data  ~5000 files  file size ranging between 1KB to 50MB  Setup  Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2 3 GB RAM  Ubuntu Release 12.04 (precise) 32-bit
  • 21. Indexing 500 400 300 200 100 0 Time to Index pylucene whoosh time (s)
  • 22. Committing 300 250 200 150 100 50 0 Time to Commit pylucene whoosh time (s)
  • 23. Searching 0.01 0.008 0.006 0.004 0.002 0 Time to Search pylucene whoosh time (s)
  • 24. Recommendations  Search Engine Library No one solution fits all problems Search engine abstraction is the key Scalability is critical Rapid to setup, develop and tweak Understand and use 
  • 25.  Getting rapid and easier by the day  Web frameworks Web development in Python  Django, Pylons  Http Servers  Tornado, Gunicorn  Support for SQL/NoSQL databases MySQL-python, pymongo  Template Engines  Cheetah, jinja2  Search  Pylucene, Whoosh
  • 26. References  Whoosh  https://bitbucket.org/mchaput/whoosh/wiki/Home  Pylucene  http://lucene.apache.org/pylucene/  http://lucene.apache.org/core/3_6_1/api/all/index.html  Xappy  http://code.google.com/p/xappy/  ElasticSearch  http://www.elasticsearch.org/guide/reference/api/
  • 27. References  Chetan’s tech space  http://technobeans.com  Vishal’s technical blog  http://freethreads.net
  • 28. Q and A
  • 29. Backup
  • 30. Whoosh v/s Haystack v/s Xapian • Whoosh is suitable for a small project. Limited scalability for search and indexing – A good beginning • Haystack is appropriate with Django • Xapian is ultra fast, but is not as feature rich as Solr • Lucene is not distributed; has external dependency
  • 31. Lucene v/s Database search • There are a number of query types that RDBMSs in general do not support without vendor extensions: • Fuzzy queries, in which "fuzzy" and "wuzzy" are considered matches • Word stemming queries, which consider "take," "took," and "taken" to be identical • Sound-like queries, which consider "cat" and "kat" to be identical • Synonym queries, which consider "jump," "hop," and "leap" to be identical • Queries on binary BLOB data types, such as PDF documents, Microsoft Word or Excel documents, or HTML and XML documents • More disappointingly, SQL search results are not ranked by match-relevance scores. The SQL standard is simply not intended for full-text querying.
  • 32. • Indexing – Convert files to a format for quick look up – Fast random access to stored words • Searching – Specify keywords • Displaying – Lookup documents that are relevant – Ranking – Different types of queries Typical search engine
  • 33. Advanced Searching  Morelikethis  didyoumean