PyCon India 2012: Rapid development of website search in python

Chetan Giridhar
Chetan GiridharChief Technology Officer at Gaglers Inc
Rapid development of 
website search in Python 
PyCon India, 
Bangalore, Sept’ 12 
Chetan Giridhar
For whom! 
 If you’re, 
an experienced developer who has 
implemented search solutions 
currently dirtying your hands 
prototyping website search for your startup 
dreading to learn Java  
just curious..
Think web development 
 Core functionality 
 Design patterns 
 Web Interface 
 Usability 
 Scalability 
 Performance 
 …?
Search 
 Often considered – ‘good to have’ 
 Enhances user experience 
 Focused information 
 Relevance 
 Interaction 
 Ranked searching
Typical Search Engine 
 Designing a schema 
 Convert your data as Documents and store 
them to index 
 Document is a set of fields 
 Field is a name=value pair 
 {title = “python”, content = “computer”, 
tag = “language”} 
 Analyzers 
 "parse" each field of your data into index-able 
"tokens" or keywords. 
 “Welcome to Pycon" it will produce list 
[“welcome", “to", “Pycon”]
Typical Search Engine 
 Indexing 
 Adding documents to the index 
 Query and query parsers 
 Prepare query 
 Parse 
 Analyze 
 Searching 
 Lookup index
Schema 
based 
document 
Index Writer 
Indexing & Committing 
Input 
files 
Field1 
Field3 
Analyzer 
Field2 
In-memory 
Index 
Committed
Query Parser Analyzer 
Results 
Searching 
Input query 
Index Searcher 
Index
 Sourcing input data set 
 Handling input queries 
 How to search 
 Search engines 
 How to display results 
 Customization 
Development : Considerations
 Apache Solr: Sunburnt 
 Haystack 
 Xapian: Xappy 
 Elastic Search 
Development: Options 
 Whoosh 
 Lucene: Pylucene
 Pythonic APIs 
 Deployment 
Large scale and 
medium sized web sites 
Talking Pylucene & Whoosh 
 Rapid 
Minimal installation 
Clear Documentation 
Quick Setup 
Ease of Integration
Pylucene 
 Pylucene: Python wrappers to Lucene 
 The de-facto standard for search engine library 
 Lucene: an open source, pure Java, search 
engine library 
 Embeds a Java VM with Lucene into a Python 
process
Pylucene 
 Simple API 
 High performance indexing 
 Scalable to millions of documents 
 Efficient and feature rich search algorithms 
 Cross platform
Whoosh 
 Whoosh is a search engine library 
 Fast indexing and search 
 One of the fastest Python search engine 
 100% Python code 
 Extensible code 
 No external dependency 
 Active development and support
Whoosh 
 Easy to setup 
 Neutral to web frameworks 
 Powerful query language 
 Feature rich 
 Intuitive APIs
 Document 
 Field 
 IndexWriter 
 QueryParser 
 Analyzer 
 IndexSearcher 
 fields.Schema 
 index.Index 
 qparser.QueryParser 
 analysis. Analyzer 
 searching.Searcher 
PyLucene Whoosh
 Search design should be: 
 An independent component 
Pluggable 
Platform independent 
Assume minimal external dependency 
Easily extendible 
Seamless integration 
Designing search in websites
Search.py 
fsMgr
Demo
Comparing Engines 
 Basis of comparison 
 Indexing, Committing and Searching 
 Dataset 
 1 GB data 
 ~5000 files 
 file size ranging between 1KB to 50MB 
 Setup 
 Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2 
3 GB RAM 
 Ubuntu Release 12.04 (precise) 32-bit
Indexing 
500 
400 
300 
200 
100 
0 
Time to Index 
pylucene whoosh 
time (s)
Committing 
300 
250 
200 
150 
100 
50 
0 
Time to Commit 
pylucene whoosh 
time (s)
Searching 
0.01 
0.008 
0.006 
0.004 
0.002 
0 
Time to Search 
pylucene whoosh 
time (s)
Recommendations 
 Search Engine Library 
No one solution fits all problems 
Search engine abstraction is the key 
Scalability is critical 
Rapid to setup, develop and tweak 
Understand and use 
 Getting rapid and easier by the day 
 Web frameworks 
Web development in Python 
 Django, Pylons 
 Http Servers 
 Tornado, Gunicorn 
 Support for SQL/NoSQL databases 
MySQL-python, pymongo 
 Template Engines 
 Cheetah, jinja2 
 Search 
 Pylucene, Whoosh
References 
 Whoosh 
 https://bitbucket.org/mchaput/whoosh/wiki/Home 
 Pylucene 
 http://lucene.apache.org/pylucene/ 
 http://lucene.apache.org/core/3_6_1/api/all/index.html 
 Xappy 
 http://code.google.com/p/xappy/ 
 ElasticSearch 
 http://www.elasticsearch.org/guide/reference/api/
References 
 Chetan’s tech space 
 http://technobeans.com 
 Vishal’s technical blog 
 http://freethreads.net
Q and A
Backup
Whoosh v/s Haystack v/s Xapian 
• Whoosh is suitable for a small project. Limited 
scalability for search and indexing 
– A good beginning 
• Haystack is appropriate with Django 
• Xapian is ultra fast, but is not as feature rich as 
Solr 
• Lucene is not distributed; has external 
dependency
Lucene v/s Database search 
• There are a number of query types that RDBMSs in general do not 
support without vendor extensions: 
• Fuzzy queries, in which "fuzzy" and "wuzzy" are considered 
matches 
• Word stemming queries, which consider "take," "took," and "taken" 
to be identical 
• Sound-like queries, which consider "cat" and "kat" to be identical 
• Synonym queries, which consider "jump," "hop," and "leap" to be 
identical 
• Queries on binary BLOB data types, such as PDF documents, 
Microsoft Word or Excel documents, or HTML and XML documents 
• More disappointingly, SQL search results are not ranked by match-relevance 
scores. The SQL standard is simply not intended for full-text 
querying.
• Indexing 
– Convert files to a format for quick 
look up 
– Fast random access to stored words 
• Searching 
– Specify keywords 
• Displaying 
– Lookup documents that are 
relevant 
– Ranking 
– Different types of queries 
Typical search engine
Advanced Searching 
 Morelikethis 
 didyoumean
1 of 33

Recommended

Tutorial on-python-programming by
Tutorial on-python-programmingTutorial on-python-programming
Tutorial on-python-programmingChetan Giridhar
1.4K views62 slides
Apache Solr/Lucene Internals by Anatoliy Sokolenko by
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
7.3K views70 slides
Boilerpipe Integration And Improvement by
Boilerpipe Integration And ImprovementBoilerpipe Integration And Improvement
Boilerpipe Integration And ImprovementAllan Huang
1.1K views11 slides
Introduction to Apache Lucene/Solr by
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
7.1K views33 slides
BSides Lisbon 2013 - All your sites belong to Burp by
BSides Lisbon 2013 - All your sites belong to BurpBSides Lisbon 2013 - All your sites belong to Burp
BSides Lisbon 2013 - All your sites belong to BurpTiago Mendo
18.8K views85 slides
Enterprise Search Using Apache Solr by
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solrsagar chaturvedi
2.1K views10 slides

More Related Content

What's hot

How the Lucene More Like This Works by
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This WorksSease
11.4K views30 slides
How To Start Your InfoSec Career by
How To Start Your InfoSec CareerHow To Start Your InfoSec Career
How To Start Your InfoSec CareerAndrew McNicol
1.3K views33 slides
Solr 101 by
Solr 101Solr 101
Solr 101Findwise
2K views24 slides
Introduction to Apache Solr by
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
4.3K views31 slides
Web Hacking With Burp Suite 101 by
Web Hacking With Burp Suite 101Web Hacking With Burp Suite 101
Web Hacking With Burp Suite 101Zack Meyers
8K views83 slides
Django Documentation by
Django DocumentationDjango Documentation
Django DocumentationYing wei (Joe) Chou
780 views1671 slides

What's hot(20)

How the Lucene More Like This Works by Sease
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This Works
Sease11.4K views
How To Start Your InfoSec Career by Andrew McNicol
How To Start Your InfoSec CareerHow To Start Your InfoSec Career
How To Start Your InfoSec Career
Andrew McNicol1.3K views
Solr 101 by Findwise
Solr 101Solr 101
Solr 101
Findwise2K views
Introduction to Apache Solr by Andy Jackson
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Andy Jackson4.3K views
Web Hacking With Burp Suite 101 by Zack Meyers
Web Hacking With Burp Suite 101Web Hacking With Burp Suite 101
Web Hacking With Burp Suite 101
Zack Meyers8K views
API Design & Security in django by Tareque Hossain
API Design & Security in djangoAPI Design & Security in django
API Design & Security in django
Tareque Hossain11.3K views
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr by Sease
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Sease7.3K views
JBUG 11 - Django-The Web Framework For Perfectionists With Deadlines by Tikal Knowledge
JBUG 11 - Django-The Web Framework For Perfectionists With DeadlinesJBUG 11 - Django-The Web Framework For Perfectionists With Deadlines
JBUG 11 - Django-The Web Framework For Perfectionists With Deadlines
Tikal Knowledge4.7K views
Intro to Apache Lucene and Solr by Grant Ingersoll
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
Grant Ingersoll5.2K views
Lucene for Solr Developers by Erik Hatcher
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher1.1K views
Web Development with Python and Django by Michael Pirnat
Web Development with Python and DjangoWeb Development with Python and Django
Web Development with Python and Django
Michael Pirnat188.7K views
Python & Django TTT by kevinvw
Python & Django TTTPython & Django TTT
Python & Django TTT
kevinvw2.4K views
Explainability for Learning to Rank by Sease
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
Sease673 views
Rapid Prototyping with Solr by Erik Hatcher
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher1.2K views
Advanced Document Similarity With Apache Lucene by Alessandro Benedetti
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache Lucene
Alessandro Benedetti12.7K views

Viewers also liked

Async programming and python by
Async programming and pythonAsync programming and python
Async programming and pythonChetan Giridhar
4.3K views42 slides
Making Django and NoSQL Play Nice by
Making Django and NoSQL Play NiceMaking Django and NoSQL Play Nice
Making Django and NoSQL Play NiceAlex Gaynor
6.1K views33 slides
Relational vs. Non-Relational by
Relational vs. Non-RelationalRelational vs. Non-Relational
Relational vs. Non-RelationalPostgreSQL Experts, Inc.
9.5K views98 slides
Graph Databases in Python (PyCon Canada 2012) by
Graph Databases in Python (PyCon Canada 2012)Graph Databases in Python (PyCon Canada 2012)
Graph Databases in Python (PyCon Canada 2012)Javier de la Rosa
12.9K views65 slides
Sql vs NoSQL by
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQLRTigger
114.7K views17 slides
Introduction to NoSQL Databases by
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
47.7K views36 slides

Viewers also liked(6)

Making Django and NoSQL Play Nice by Alex Gaynor
Making Django and NoSQL Play NiceMaking Django and NoSQL Play Nice
Making Django and NoSQL Play Nice
Alex Gaynor6.1K views
Graph Databases in Python (PyCon Canada 2012) by Javier de la Rosa
Graph Databases in Python (PyCon Canada 2012)Graph Databases in Python (PyCon Canada 2012)
Graph Databases in Python (PyCon Canada 2012)
Javier de la Rosa12.9K views
Sql vs NoSQL by RTigger
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQL
RTigger114.7K views
Introduction to NoSQL Databases by Derek Stainer
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
Derek Stainer47.7K views

Similar to PyCon India 2012: Rapid development of website search in python

EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap... by
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...EPC Group
813 views47 slides
Enterprise Search in SharePoint 2010 by
Enterprise Search in SharePoint 2010Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010bgerman
5.5K views48 slides
Apache lucene by
Apache luceneApache lucene
Apache luceneDr. Abhiram Gandhe
4.4K views26 slides
In search of: A meetup about Liferay and Search 2016-04-20 by
In search of: A meetup about Liferay and Search   2016-04-20In search of: A meetup about Liferay and Search   2016-04-20
In search of: A meetup about Liferay and Search 2016-04-20Tibor Lipusz
714 views85 slides
Reference material: Topics or databases? by
Reference material: Topics or databases?Reference material: Topics or databases?
Reference material: Topics or databases?Ben Colborn
65 views24 slides
Apache Lucene: Searching the Web and Everything Else (Jazoon07) by
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
10.1K views35 slides

Similar to PyCon India 2012: Rapid development of website search in python(20)

EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap... by EPC Group
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group 813 views
Enterprise Search in SharePoint 2010 by bgerman
Enterprise Search in SharePoint 2010Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010
bgerman5.5K views
In search of: A meetup about Liferay and Search 2016-04-20 by Tibor Lipusz
In search of: A meetup about Liferay and Search   2016-04-20In search of: A meetup about Liferay and Search   2016-04-20
In search of: A meetup about Liferay and Search 2016-04-20
Tibor Lipusz714 views
Reference material: Topics or databases? by Ben Colborn
Reference material: Topics or databases?Reference material: Topics or databases?
Reference material: Topics or databases?
Ben Colborn65 views
Apache Lucene: Searching the Web and Everything Else (Jazoon07) by dnaber
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber10.1K views
Intelligent crawling and indexing using lucene by Swapnil & Patil
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil 2.9K views
Enterprise Search in SharePoint 2013 by Findwise
Enterprise Search in SharePoint 2013Enterprise Search in SharePoint 2013
Enterprise Search in SharePoint 2013
Findwise7.6K views
ElasticSearch in Production: lessons learned by BeyondTrees
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
BeyondTrees65K views
If You Have The Content, Then Apache Has The Technology! by gagravarr
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
gagravarr1K views
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA... by Ted Drake
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Ted Drake132.6K views
Fried dallas spug by Jeff Fried
Fried dallas spugFried dallas spug
Fried dallas spug
Jeff Fried540 views
Drupal and Apache Stanbol by Alkuvoima
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
Alkuvoima 1.3K views
The Apache Solr Smart Data Ecosystem by Trey Grainger
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger6.1K views
Webinar: Fusion 3.1 - What's New by Lucidworks
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
Lucidworks786 views
The original vision of Nutch, 14 years later: Building an open source search ... by Sylvain Zimmer
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
Sylvain Zimmer552 views

More from Chetan Giridhar

Rapid development & integration of real time communication in websites by
Rapid development & integration of real time communication in websitesRapid development & integration of real time communication in websites
Rapid development & integration of real time communication in websitesChetan Giridhar
956 views19 slides
Fuse'ing python for rapid development of storage efficient FS by
Fuse'ing python for rapid development of storage efficient FSFuse'ing python for rapid development of storage efficient FS
Fuse'ing python for rapid development of storage efficient FSChetan Giridhar
2.9K views29 slides
Diving into byte code optimization in python by
Diving into byte code optimization in python Diving into byte code optimization in python
Diving into byte code optimization in python Chetan Giridhar
4.5K views28 slides
Testers in product development code review phase by
Testers in product development   code review phaseTesters in product development   code review phase
Testers in product development code review phaseChetan Giridhar
786 views23 slides
Design patterns in python v0.1 by
Design patterns in python v0.1Design patterns in python v0.1
Design patterns in python v0.1Chetan Giridhar
5.7K views38 slides
PyCon India 2011: Python Threads: Dive into GIL! by
PyCon India 2011: Python Threads: Dive into GIL!PyCon India 2011: Python Threads: Dive into GIL!
PyCon India 2011: Python Threads: Dive into GIL!Chetan Giridhar
1.4K views35 slides

More from Chetan Giridhar(7)

Rapid development & integration of real time communication in websites by Chetan Giridhar
Rapid development & integration of real time communication in websitesRapid development & integration of real time communication in websites
Rapid development & integration of real time communication in websites
Chetan Giridhar956 views
Fuse'ing python for rapid development of storage efficient FS by Chetan Giridhar
Fuse'ing python for rapid development of storage efficient FSFuse'ing python for rapid development of storage efficient FS
Fuse'ing python for rapid development of storage efficient FS
Chetan Giridhar2.9K views
Diving into byte code optimization in python by Chetan Giridhar
Diving into byte code optimization in python Diving into byte code optimization in python
Diving into byte code optimization in python
Chetan Giridhar4.5K views
Testers in product development code review phase by Chetan Giridhar
Testers in product development   code review phaseTesters in product development   code review phase
Testers in product development code review phase
Chetan Giridhar786 views
Design patterns in python v0.1 by Chetan Giridhar
Design patterns in python v0.1Design patterns in python v0.1
Design patterns in python v0.1
Chetan Giridhar5.7K views
PyCon India 2011: Python Threads: Dive into GIL! by Chetan Giridhar
PyCon India 2011: Python Threads: Dive into GIL!PyCon India 2011: Python Threads: Dive into GIL!
PyCon India 2011: Python Threads: Dive into GIL!
Chetan Giridhar1.4K views
Pycon11: Python threads: Dive into GIL! by Chetan Giridhar
Pycon11: Python threads: Dive into GIL!Pycon11: Python threads: Dive into GIL!
Pycon11: Python threads: Dive into GIL!
Chetan Giridhar2.6K views

Recently uploaded

Retail Store Scavenger Hunt.pdf by
Retail Store Scavenger Hunt.pdfRetail Store Scavenger Hunt.pdf
Retail Store Scavenger Hunt.pdfRoxanneReed
43 views10 slides
https://pin.it/1TgD6Uq by
https://pin.it/1TgD6Uqhttps://pin.it/1TgD6Uq
https://pin.it/1TgD6Uqbestoto
5 views1 slide
Free eBook ~ 200 GREAT PUNS.pdf by
Free eBook ~ 200 GREAT PUNS.pdfFree eBook ~ 200 GREAT PUNS.pdf
Free eBook ~ 200 GREAT PUNS.pdfOH TEIK BIN
27 views33 slides
Maria Gjieli.pdf by
Maria Gjieli.pdfMaria Gjieli.pdf
Maria Gjieli.pdfget joys
18 views6 slides
armageddon recap.pdf by
armageddon recap.pdfarmageddon recap.pdf
armageddon recap.pdfrecap revelations
6 views1 slide
Perfect Wedding Hub Magazine Nov Edition by
Perfect Wedding Hub Magazine Nov EditionPerfect Wedding Hub Magazine Nov Edition
Perfect Wedding Hub Magazine Nov Editionrakhiraajan
10 views16 slides

Recently uploaded(15)

Retail Store Scavenger Hunt.pdf by RoxanneReed
Retail Store Scavenger Hunt.pdfRetail Store Scavenger Hunt.pdf
Retail Store Scavenger Hunt.pdf
RoxanneReed43 views
https://pin.it/1TgD6Uq by bestoto
https://pin.it/1TgD6Uqhttps://pin.it/1TgD6Uq
https://pin.it/1TgD6Uq
bestoto5 views
Free eBook ~ 200 GREAT PUNS.pdf by OH TEIK BIN
Free eBook ~ 200 GREAT PUNS.pdfFree eBook ~ 200 GREAT PUNS.pdf
Free eBook ~ 200 GREAT PUNS.pdf
OH TEIK BIN27 views
Maria Gjieli.pdf by get joys
Maria Gjieli.pdfMaria Gjieli.pdf
Maria Gjieli.pdf
get joys18 views
Perfect Wedding Hub Magazine Nov Edition by rakhiraajan
Perfect Wedding Hub Magazine Nov EditionPerfect Wedding Hub Magazine Nov Edition
Perfect Wedding Hub Magazine Nov Edition
rakhiraajan10 views
What Makes an Excellent Short Film by Hampton Luzak
What Makes an Excellent Short FilmWhat Makes an Excellent Short Film
What Makes an Excellent Short Film
Hampton Luzak5 views
BOOTS PUT FOR SANTA by Judy 1028
BOOTS PUT FOR SANTABOOTS PUT FOR SANTA
BOOTS PUT FOR SANTA
Judy 102810 views
Lyric Presentation.pdf by ally508153
Lyric Presentation.pdfLyric Presentation.pdf
Lyric Presentation.pdf
ally50815310 views
Scratches in the Attic - Script.pdf by ColbyHoltman
Scratches in the Attic - Script.pdfScratches in the Attic - Script.pdf
Scratches in the Attic - Script.pdf
ColbyHoltman6 views
JADOO FLIX by Sagar.pptx by getseokey
JADOO FLIX by Sagar.pptxJADOO FLIX by Sagar.pptx
JADOO FLIX by Sagar.pptx
getseokey6 views
LETTERS TO SANTA CLAUS by Judy 1028
LETTERS TO SANTA CLAUSLETTERS TO SANTA CLAUS
LETTERS TO SANTA CLAUS
Judy 102812 views

PyCon India 2012: Rapid development of website search in python

  • 1. Rapid development of website search in Python PyCon India, Bangalore, Sept’ 12 Chetan Giridhar
  • 2. For whom!  If you’re, an experienced developer who has implemented search solutions currently dirtying your hands prototyping website search for your startup dreading to learn Java  just curious..
  • 3. Think web development  Core functionality  Design patterns  Web Interface  Usability  Scalability  Performance  …?
  • 4. Search  Often considered – ‘good to have’  Enhances user experience  Focused information  Relevance  Interaction  Ranked searching
  • 5. Typical Search Engine  Designing a schema  Convert your data as Documents and store them to index  Document is a set of fields  Field is a name=value pair  {title = “python”, content = “computer”, tag = “language”}  Analyzers  "parse" each field of your data into index-able "tokens" or keywords.  “Welcome to Pycon" it will produce list [“welcome", “to", “Pycon”]
  • 6. Typical Search Engine  Indexing  Adding documents to the index  Query and query parsers  Prepare query  Parse  Analyze  Searching  Lookup index
  • 7. Schema based document Index Writer Indexing & Committing Input files Field1 Field3 Analyzer Field2 In-memory Index Committed
  • 8. Query Parser Analyzer Results Searching Input query Index Searcher Index
  • 9.  Sourcing input data set  Handling input queries  How to search  Search engines  How to display results  Customization Development : Considerations
  • 10.  Apache Solr: Sunburnt  Haystack  Xapian: Xappy  Elastic Search Development: Options  Whoosh  Lucene: Pylucene
  • 11.  Pythonic APIs  Deployment Large scale and medium sized web sites Talking Pylucene & Whoosh  Rapid Minimal installation Clear Documentation Quick Setup Ease of Integration
  • 12. Pylucene  Pylucene: Python wrappers to Lucene  The de-facto standard for search engine library  Lucene: an open source, pure Java, search engine library  Embeds a Java VM with Lucene into a Python process
  • 13. Pylucene  Simple API  High performance indexing  Scalable to millions of documents  Efficient and feature rich search algorithms  Cross platform
  • 14. Whoosh  Whoosh is a search engine library  Fast indexing and search  One of the fastest Python search engine  100% Python code  Extensible code  No external dependency  Active development and support
  • 15. Whoosh  Easy to setup  Neutral to web frameworks  Powerful query language  Feature rich  Intuitive APIs
  • 16.  Document  Field  IndexWriter  QueryParser  Analyzer  IndexSearcher  fields.Schema  index.Index  qparser.QueryParser  analysis. Analyzer  searching.Searcher PyLucene Whoosh
  • 17.  Search design should be:  An independent component Pluggable Platform independent Assume minimal external dependency Easily extendible Seamless integration Designing search in websites
  • 19. Demo
  • 20. Comparing Engines  Basis of comparison  Indexing, Committing and Searching  Dataset  1 GB data  ~5000 files  file size ranging between 1KB to 50MB  Setup  Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2 3 GB RAM  Ubuntu Release 12.04 (precise) 32-bit
  • 21. Indexing 500 400 300 200 100 0 Time to Index pylucene whoosh time (s)
  • 22. Committing 300 250 200 150 100 50 0 Time to Commit pylucene whoosh time (s)
  • 23. Searching 0.01 0.008 0.006 0.004 0.002 0 Time to Search pylucene whoosh time (s)
  • 24. Recommendations  Search Engine Library No one solution fits all problems Search engine abstraction is the key Scalability is critical Rapid to setup, develop and tweak Understand and use 
  • 25.  Getting rapid and easier by the day  Web frameworks Web development in Python  Django, Pylons  Http Servers  Tornado, Gunicorn  Support for SQL/NoSQL databases MySQL-python, pymongo  Template Engines  Cheetah, jinja2  Search  Pylucene, Whoosh
  • 26. References  Whoosh  https://bitbucket.org/mchaput/whoosh/wiki/Home  Pylucene  http://lucene.apache.org/pylucene/  http://lucene.apache.org/core/3_6_1/api/all/index.html  Xappy  http://code.google.com/p/xappy/  ElasticSearch  http://www.elasticsearch.org/guide/reference/api/
  • 27. References  Chetan’s tech space  http://technobeans.com  Vishal’s technical blog  http://freethreads.net
  • 30. Whoosh v/s Haystack v/s Xapian • Whoosh is suitable for a small project. Limited scalability for search and indexing – A good beginning • Haystack is appropriate with Django • Xapian is ultra fast, but is not as feature rich as Solr • Lucene is not distributed; has external dependency
  • 31. Lucene v/s Database search • There are a number of query types that RDBMSs in general do not support without vendor extensions: • Fuzzy queries, in which "fuzzy" and "wuzzy" are considered matches • Word stemming queries, which consider "take," "took," and "taken" to be identical • Sound-like queries, which consider "cat" and "kat" to be identical • Synonym queries, which consider "jump," "hop," and "leap" to be identical • Queries on binary BLOB data types, such as PDF documents, Microsoft Word or Excel documents, or HTML and XML documents • More disappointingly, SQL search results are not ranked by match-relevance scores. The SQL standard is simply not intended for full-text querying.
  • 32. • Indexing – Convert files to a format for quick look up – Fast random access to stored words • Searching – Specify keywords • Displaying – Lookup documents that are relevant – Ranking – Different types of queries Typical search engine
  • 33. Advanced Searching  Morelikethis  didyoumean

Editor's Notes

  1. Whoosh? If you love Python more than learning Java.