PyCon India 2012: Rapid development of website search in python
Upcoming SlideShare
Loading in...5
×
 

PyCon India 2012: Rapid development of website search in python

on

  • 1,491 views

 

Statistics

Views

Total Views
1,491
Views on SlideShare
909
Embed Views
582

Actions

Likes
1
Downloads
8
Comments
0

1 Embed 582

http://in.pycon.org 582

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

PyCon India 2012: Rapid development of website search in python PyCon India 2012: Rapid development of website search in python Presentation Transcript

  • Rapid development ofwebsite search in Python PyCon India, Bangalore, Sept’ 12 Vishal Kanaujia, Chetan Giridhar
  • If you’re, an experienced developer who has implemented search solutions currently dirtying your hands prototyping website search for your startup dreading to learn Java ☺ just curious.. For whom!
  • Core functionalityDesign patternsWeb InterfaceUsabilityScalabilityPerformance…? Think web development
  • Often considered – ‘good to have’Enhances user experience Focused information Relevance Interaction Ranked searching Search
  • Designing a schema Convert your data as Documents and store them to index Document is a set of fields Field is a name=value pair {title = “python”, content = “computer”, tag = “language”}Analyzers "parse" each field of your data into index- able "tokens" or keywords. “Welcome to Pycon" it will produce list [“welcome", “to", “Pycon”] Typical Search Engine
  • Indexing Adding documents to the indexQuery and query parsers Prepare query Parse AnalyzeSearching Lookup index Typical Search Engine
  • Field1 Schema Input Field2 based files document Field3 Index Writer Analyzer In-memoryCommitted Index Indexing & Committing
  • Input query Query Parser Analyzer Index Searcher Results Index Searching
  • Sourcing input data setHandling input queriesHow to search AlgorithmsHow to display results Customization Development : Considerations
  • Apache Solr: SunburntHaystackXapian: XappyElastic SearchWhooshLucene: Pylucene Development: Options
  • Pythonic APIsDeployment Large scale and medium sized web sitesRapid Minimal installation Clear Documentation Quick Setup Ease of Integration Talking Pylucene & Whoosh
  • Pylucene: Python wrappers to LuceneLucene: an open source, pure Java, searchengine libraryThe de-facto standard for search engine libraryEmbeds a Java VM with Lucene into a Pythonprocess Pylucene
  • Simple APIHigh performance indexingScalable to millions of documentsEfficient and feature rich search algorithmsCross platform Pylucene
  • Whoosh is a search engine libraryFast indexing and search One of the fastest Python search engine100% Python codeExtensible codeNo external dependencyActive development and support Whoosh
  • Easy to setupNeutral to web frameworksPowerful query languageFeature richIntuitive APIs Whoosh
  • Document fields.SchemaFieldIndexWriter index.IndexQueryParser qparser.QueryParserAnalyzer analysis. AnalyzerIndexSearcher searching.Searcher PyLucene Whoosh
  • Search design should be: An independent component Pluggable Platform independent Assume minimal external dependency Easily extendible Seamless integration Designing search in websites
  • Abstraction in Search fsMgrSearch.py
  • Demo
  • Basis of comparison Indexing, Committing and SearchingDataset 1 GB data ~5000 files file size ranging between 1KB to 50MBSetup Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2 3 GB RAM Ubuntu Release 12.04 (precise) 32-bit Comparing Engines
  • Time to Index500400300200 time (s)100 0 pylucene whoosh Indexing
  • Time to Commit300250200150 time (s)100 50 0 pylucene whoosh Committing
  • Time to Search 0.010.0080.0060.004 time (s)0.002 0 pylucene whoosh Searching
  • Search Engine Library No one solution fits all problems Scalability is critical Rapid to setup, develop and tweakUnderstand and use ☺ Recommendations
  • Getting rapid and easier by the dayWeb frameworks Django, PylonsHttp Servers Tornado, GunicornSupport for SQL/NoSQL databases MySQL-python, pymongoTemplate Engines Cheetah, jinja2Search Pylucene, Whoosh Web development in Python
  • Whoosh https://bitbucket.org/mchaput/whoosh/wiki/HomePylucene http://lucene.apache.org/pylucene/ http://lucene.apache.org/core/3_6_1/api/all/index.htmlXappy http://code.google.com/p/xappy/ElasticSearch http://www.elasticsearch.org/guide/reference/api/ References
  • Q and A