• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Scaling search to a million pages with Solr, Python, and Django
 

Scaling search to a million pages with Solr, Python, and Django

on

  • 8,462 views

A talk given to DJUGL on the 26th July 2010, describing and introducing Solr, and discussing how we use it at Timetric to drive navigation across over a million dataseries.

A talk given to DJUGL on the 26th July 2010, describing and introducing Solr, and discussing how we use it at Timetric to drive navigation across over a million dataseries.

Statistics

Views

Total Views
8,462
Views on SlideShare
8,461
Embed Views
1

Actions

Likes
6
Downloads
95
Comments
1

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Oh-so-relevant subject matter AND elegant presentation style. Thank you.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Scaling search to a million pages with Solr, Python, and Django Scaling search to a million pages with Solr, Python, and Django Presentation Transcript

    • Scaling search to a million pages with Solr, Django and Python Toby White toby@timetric.com @tow21
    • 1,079,446!!!
    • Data store Django Big Bad Web
    • Data store Django Big Bad Web
    • Key-Value Store Filesystem Berkeley DB } unstructured MySQL - structured
    • Foreign Key (RDBMS) SQLite MySQL related content Postgres through JOINs Oracle over ... structured data
    • Search Engines Solr (Lucene) Denormalized, Xapian Inverted Index (Whoosh) over unstructured/ semi-structured data
    • Other routes to full-text search http://www.postgresql.org/docs/8.4/static/textsearch.html http://code.google.com/p/djangosearch/ http://www.sphinxsearch.com/
    • Solr: HTTP interface to Lucene Lucene written by Doug Cutting (HADOOP), first release 2001. Solr in-house CNET project, open-sourced in 2006 Solr 1.4, Lucene 3.0 released November 2009 Solr + Lucene merged in March 2010 Next version - 1.5/3.1/4.0 - not for production use yet.
    • Solr RDBMS Index Table composed of composed of Documents Rows ALL DOCUMENTS HAVE THE SAME STRUCTURE
    • •Optional columns Document Field options •Denormalized data Entity type required Title required Identifier uniqueKey Pub. Frequency Book Magazine Person Associated multiValued Title Title First name name multiValued, Default Search ISBN ISSN Last name default Author Publication (FK Person) Frequency copyField Title Editor (FK Person) Associated Default Search Name Contributer (M2M Person)
    • There is no update, only overwrite!!! Book Book Solar Solr 1.4 Enterprise Enterprise Search Server Search Server Identifier Identifier Pub. Freq. Pub. Freq. David Smiley, David Smiley, Eric Pugh Eric Pugh Solr can't overwrite without a uniqueKey
    • Schema design <field name="title" text type="text" int indexed="true" long stored="true" float required="true" double multiValued="false" date /> query What do you want to search on? What do you want to do with results?
    • <xml>, <xml>, csv, {json}, exec. python Ingest Output HTTP Solr HTTP Query: URL-escaped Lucene query syntax (yuck)
    • GET http://localhost:8983/solr/select/?q=searchterm GET http://localhost:8983/solr/current/select/? fq=private %3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=2 0&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags %3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A %22united+kingdom%22+AND+NOT+is_mapreduce%3Atrue%29+OR +%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA %22+AND+tags%3A%22united+kingdom%22+AND+is_index %3Atrue%5E100%29
    • Need ORM equivalent (OIM?) Sunburnt: http://timetric.com/about/opensource/#sunburnt http://github.com/tow/sunburnt http://haystacksearch.org/ (cleaves close to Django, not schema-driven)
    • GET http://localhost:8983/solr/current/select/? fq=private %3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=2 0&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags %3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A %22united+kingdom%22%29+OR+%28%28tags%3A%22ons %3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united +kingdom%22+AND+is_index%3Atrue%5E100%29 solr.query(tags="ons:dataseries-fullid=YBUKQA") .query(tags="united kingdom") .filter(private=False) .boost_relevancy(100, is_index=True) .facet_by("tags", mincount=1, limit=20) .paginate(rows=20)
    • Faceting MoreLikeThis Highlighting Pagination Sorting http://wiki.apache.org/solr/FrontPage http://packtpub.com/ solr-1-4-enterprise-search-server
    • Scaling to a million pages ... - talk to the Guardian (Content API) Decouple read/write Re-indexing/optimizing strategies FieldType/Analyzer/Tokenizer tweaks
    • Decouple read/write Separate processes - many readers, single write pipeline. Beware multiple writers! Remember standard DB practice - write to master, read from slave.
    • Add Index documents Index Fast Index Commit Index Index Warm up facet cache Index Optimize
    • "UK crime: Betting, gaming and lotteries (year ending 5th April)" Tokenizer Betting Analyzer (Porter stemmer) bet Tokenizer (character filter) BE,T Tokenizer (whitespace) Belgium, Unemployment rate by gender, Total (BE,T)
    • In the small Understand Solr schemas - build one for your data. how do you want to query? how do you want to show results? In the large Understand Solr architecture - build around your data-flow. how/when do you want to read/write? what shape/characteristics does your corpus have
    • Thanks for listening! questions welcome ... toby@timetric.com @tow21