Scaling search to a million pages with Solr, Python, and Django

9,604 views

Published on

A talk given to DJUGL on the 26th July 2010, describing and introducing Solr, and discussing how we use it at Timetric to drive navigation across over a million dataseries.

Published in: Technology
1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total views
9,604
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
123
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide

Scaling search to a million pages with Solr, Python, and Django

  1. 1. Scaling search to a million pages with Solr, Django and Python Toby White toby@timetric.com @tow21
  2. 2. 1,079,446!!!
  3. 3. Data store Django Big Bad Web
  4. 4. Data store Django Big Bad Web
  5. 5. Key-Value Store Filesystem Berkeley DB } unstructured MySQL - structured
  6. 6. Foreign Key (RDBMS) SQLite MySQL related content Postgres through JOINs Oracle over ... structured data
  7. 7. Search Engines Solr (Lucene) Denormalized, Xapian Inverted Index (Whoosh) over unstructured/ semi-structured data
  8. 8. Other routes to full-text search http://www.postgresql.org/docs/8.4/static/textsearch.html http://code.google.com/p/djangosearch/ http://www.sphinxsearch.com/
  9. 9. Solr: HTTP interface to Lucene Lucene written by Doug Cutting (HADOOP), first release 2001. Solr in-house CNET project, open-sourced in 2006 Solr 1.4, Lucene 3.0 released November 2009 Solr + Lucene merged in March 2010 Next version - 1.5/3.1/4.0 - not for production use yet.
  10. 10. Solr RDBMS Index Table composed of composed of Documents Rows ALL DOCUMENTS HAVE THE SAME STRUCTURE
  11. 11. •Optional columns Document Field options •Denormalized data Entity type required Title required Identifier uniqueKey Pub. Frequency Book Magazine Person Associated multiValued Title Title First name name multiValued, Default Search ISBN ISSN Last name default Author Publication (FK Person) Frequency copyField Title Editor (FK Person) Associated Default Search Name Contributer (M2M Person)
  12. 12. There is no update, only overwrite!!! Book Book Solar Solr 1.4 Enterprise Enterprise Search Server Search Server Identifier Identifier Pub. Freq. Pub. Freq. David Smiley, David Smiley, Eric Pugh Eric Pugh Solr can't overwrite without a uniqueKey
  13. 13. Schema design <field name="title" text type="text" int indexed="true" long stored="true" float required="true" double multiValued="false" date /> query What do you want to search on? What do you want to do with results?
  14. 14. <xml>, <xml>, csv, {json}, exec. python Ingest Output HTTP Solr HTTP Query: URL-escaped Lucene query syntax (yuck)
  15. 15. GET http://localhost:8983/solr/select/?q=searchterm GET http://localhost:8983/solr/current/select/? fq=private %3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=2 0&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags %3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A %22united+kingdom%22+AND+NOT+is_mapreduce%3Atrue%29+OR +%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA %22+AND+tags%3A%22united+kingdom%22+AND+is_index %3Atrue%5E100%29
  16. 16. Need ORM equivalent (OIM?) Sunburnt: http://timetric.com/about/opensource/#sunburnt http://github.com/tow/sunburnt http://haystacksearch.org/ (cleaves close to Django, not schema-driven)
  17. 17. GET http://localhost:8983/solr/current/select/? fq=private %3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=2 0&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags %3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A %22united+kingdom%22%29+OR+%28%28tags%3A%22ons %3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united +kingdom%22+AND+is_index%3Atrue%5E100%29 solr.query(tags="ons:dataseries-fullid=YBUKQA") .query(tags="united kingdom") .filter(private=False) .boost_relevancy(100, is_index=True) .facet_by("tags", mincount=1, limit=20) .paginate(rows=20)
  18. 18. Faceting MoreLikeThis Highlighting Pagination Sorting http://wiki.apache.org/solr/FrontPage http://packtpub.com/ solr-1-4-enterprise-search-server
  19. 19. Scaling to a million pages ... - talk to the Guardian (Content API) Decouple read/write Re-indexing/optimizing strategies FieldType/Analyzer/Tokenizer tweaks
  20. 20. Decouple read/write Separate processes - many readers, single write pipeline. Beware multiple writers! Remember standard DB practice - write to master, read from slave.
  21. 21. Add Index documents Index Fast Index Commit Index Index Warm up facet cache Index Optimize
  22. 22. "UK crime: Betting, gaming and lotteries (year ending 5th April)" Tokenizer Betting Analyzer (Porter stemmer) bet Tokenizer (character filter) BE,T Tokenizer (whitespace) Belgium, Unemployment rate by gender, Total (BE,T)
  23. 23. In the small Understand Solr schemas - build one for your data. how do you want to query? how do you want to show results? In the large Understand Solr architecture - build around your data-flow. how/when do you want to read/write? what shape/characteristics does your corpus have
  24. 24. Thanks for listening! questions welcome ... toby@timetric.com @tow21

×