"Searching with Solr" - Tyler Harms, South Dakota Code Camp 2012

  • 523 views
Uploaded on

"Searching with Solr" by Tyler Harms, given November 10, 2012, at South Dakota Code Camp 2012 in Sioux Falls.

"Searching with Solr" by Tyler Harms, given November 10, 2012, at South Dakota Code Camp 2012 in Sioux Falls.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • I have indexed y data from database(mysql) I want to search such that if i write
    http://localhost:8983/solr/select?q=(any word from my documents)

    So I want to search for any word that is included in those documents. But it does not let me. Data has been indexed and if i go http://localhost:8983/solr/select?q=*:* (It shows me all the documents) now i want to do q=anyword and want to see all the documents that include this word. How can i do that?
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
523
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
23
Comments
1
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Searching with Solr AN INTRODUCTION Tyler Harms Developer @harmstyler tyler@blendinteractive.com 1Saturday, November 10, 12
  • 2. Why Implement Solr? SEARCHING WITH SOLR • Does your site need search? • Is google enough? • Do you need/want to control rankings? • Just text, or Structured Data? 2Saturday, November 10, 12
  • 3. What is Solr? SEARCHING WITH SOLR Solr is a standalone enterprise search server with a REST-like API. You put documents in it [...] over HTTP. You query it via HTTP GET and receive [...] results. 3Saturday, November 10, 12
  • 4. 4Saturday, November 10, 12
  • 5. Solr Versions SEARCHING WITH SOLR • Current Version(s) • Solr 3.6.1 • Solr 4 • Released Versions are always stable 5Saturday, November 10, 12
  • 6. $ wget http://(...)/3.6.1/apache-solr-3.6.1.tgz $ tar -xzf apache-solr-3.6.1.tgz $ cd apache-solr-3.6.1/example/ $ java -jar start.jar (a lot of java log...) 6Saturday, November 10, 12
  • 7. Search Alternatives SEARCHING WITH SOLR • Google • Lucene • elasticsearch • Whoosh • Xapien • Many Others 7Saturday, November 10, 12
  • 8. NOT a Database Replacement SEARCHING WITH SOLR • Solr is designed to live alongside your website as a separate web app 8Saturday, November 10, 12
  • 9. Frontend Database Master Servers[1..n] Database Slaves[0..n] Solr Master Solr Slaves[0..n] 10 9Saturday, November 10, 12
  • 10. Scaling Solr SEARCHING WITH SOLR • Master/Slave Architecture • Write to master -> Read from slaves • Multicore Setup • Multiple Solr ‘cores’ running alongside each other within the same install 10Saturday, November 10, 12
  • 11. Solr’s Data Model SEARCHING WITH SUB HEADLINE SOLR • Solr maintains a collection of documents • A document is a collection of fields and values • A field can occur multiple times in a doc • Documents are immutable • They can be deleted and replaced by new versions, however. 11Saturday, November 10, 12
  • 12. Querying SEARCHING WITH SUB HEADLINE SOLR • http request • http://localhost:8983/solr/select?q=blend&start=0&rows=10 12Saturday, November 10, 12
  • 13. Solr Query Syntax SEARCHING WITH SUB HEADLINE SOLR • blend (value) • company:blend (field:value) • title:”Searching with Solr” AND text:apache • id:[* TO *] • *:* (all fields : all values) 13Saturday, November 10, 12
  • 14. Using Solr SEARCHING WITH SUB HEADLINE SOLR • Getting Data into Solr • Getting Data out of Solr 14Saturday, November 10, 12
  • 15. Getting Data into Solr SEARCHING WITH SUB HEADLINE SOLR • POST it <add> <doc> <field name="abstract">Lorem ipsum</field> <field name="company">Blend Interactive</field> <field name="text">Lorem Ipsum</field> <field name="title">Some Title</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add> 15Saturday, November 10, 12
  • 16. Getting Data into Solr SEARCHING WITH SUB HEADLINE SOLR • POST it <add> <doc> <field name="abstract">Lorem ipsum</field> <field name="company">Blend Interactive</field> <field name="text">Lorem Ipsum</field> <field name="title">Some Title</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add> 16Saturday, November 10, 12
  • 17. Getting Data into Solr SEARCHING WITH SUB HEADLINE SOLR • POST it <add> <doc> <field name="abstract">Lorem ipsum</field> <field name="company">Blend Interactive</field> <field name="text">Lorem Ipsum</field> <field name="title">Some Title</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add> 17Saturday, November 10, 12
  • 18. Commiting SEARCHING WITH SUB HEADLINE SOLR • Nothing shows up in the index until you commit • You can just POST <commit/> to: • http://<host>:<port>/solr/update 18Saturday, November 10, 12
  • 19. Getting Data out of Solr SEARCHING WITH SUB HEADLINE SOLR • http://localhost:8983/solr/select/?q=solr 19Saturday, November 10, 12
  • 20. <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">19</int> <lst name="params"> <str name="q">solr</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="abstract"> A brief introduction to using Apache Solr for implementing search for your website. </str> <str name="django_ct">codecamp.session</str> <str name="django_id">19</str> <str name="id">codecamp.session.19</str> <str name="text"> Searching with Solr: An Introduction A brief introduction to using Apache Solr for implementing search for your website. </str> <str name="title">Searching with Solr: An Introduction</str> </doc> </result> </response> 20Saturday, November 10, 12
  • 21. <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">19</int> <lst name="params"> <str name="q">solr</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="abstract"> A brief introduction to using Apache Solr for implementing search for your website. </str> <str name="django_ct">codecamp.session</str> <str name="django_id">19</str> <str name="id">codecamp.session.19</str> <str name="text"> Searching with Solr: An Introduction A brief introduction to using Apache Solr for implementing search for your website. </str> <str name="title">Searching with Solr: An Introduction</str> </doc> </result> </response> 21Saturday, November 10, 12
  • 22. <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">19</int> <lst name="params"> <str name="q">solr</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="abstract"> A brief introduction to using Apache Solr for implementing search for your website. </str> <str name="django_ct">codecamp.session</str> <str name="django_id">19</str> <str name="id">codecamp.session.19</str> <str name="text"> Searching with Solr: An Introduction A brief introduction to using Apache Solr for implementing search for your website. </str> <str name="title">Searching with Solr: An Introduction</str> </doc> </result> </response> 22Saturday, November 10, 12
  • 23. Getting Data out of Solr: JSON SEARCHING WITH SUB HEADLINE SOLR • http://localhost:8983/solr/select/?q=solr&wt=json 23Saturday, November 10, 12
  • 24. { "responseHeader": { "status":0, "QTime":0, "params": { "wt":"json", "q":"solr" } }, "response": { "numFound":1, "start":0, "docs":[{ "django_id":"19", "title":"Searching with Solr: An Introduction", "text":"Searching with Solr: An IntroductionnA brief introduction to using Apache Solr for implementing search for your website.", "abstract":"A brief introduction to using Apache Solr for implementing search for your website.", "django_ct":"codecamp.session","id":"codecamp.session.19" }] } } 24Saturday, November 10, 12
  • 25. Deleting Data from Solr SEARCHING WITH SUB HEADLINE SOLR • POST it <delete><id>codecamp.session.19</id></delete> <delete><query>company:blend</query></delete> 25Saturday, November 10, 12
  • 26. The Solr Schema SEARCHING WITH SOLR • schema.xml • Defines ‘types’ used in the webapp • Defines the fields • Defines ‘copyfields’ • Read the schema inside the example project for more 26Saturday, November 10, 12
  • 27. The Solr Schema SEARCHING WITH SOLR • Types • Define how a field and query should be processed • Word Stemming • Case Folding • How would you handle a search for ‘C.I.A.’? • Dates, ints, floats, etc.. are defined here as well • 2 Modes • Index Time • Query Time 27Saturday, November 10, 12
  • 28. <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> </fieldType> 28Saturday, November 10, 12
  • 29. <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> </fieldType> 29Saturday, November 10, 12
  • 30. <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> </fieldType> 30Saturday, November 10, 12
  • 31. Fields SEARCHING WITH SOLR • The elements of a document • Both Predefined and Dynamic • Fields may occur multiple times • May be indexed and/or stored 31Saturday, November 10, 12
  • 32. <fields> <!-- general --> <field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/> <field name="django_ct" type="string" indexed="true" stored="true" multiValued="false" /> <field name="django_id" type="string" indexed="true" stored="true" multiValued="false" /> <!-- dynamic --> <dynamicField name="*_i" type="sint" indexed="true" stored="true"/> <dynamicField name="*_s" type="string" indexed="true" stored="true"/> <dynamicField name="*_l" type="slong" indexed="true" stored="true"/> <dynamicField name="*_t" type="text" indexed="true" stored="true"/> <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> <dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/> <dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/> <dynamicField name="*_dt" type="date" indexed="true" stored="true"/> <!-- app --> <field name="bio" type="text" indexed="true" stored="true" multiValued="false" /> <field name="title" type="text" indexed="true" stored="true" multiValued="false" /> <field name="text" type="text" indexed="true" stored="true" multiValued="false" /> <field name="abstract" type="text" indexed="true" stored="true" multiValued="false" /> <field name="full_name" type="text" indexed="true" stored="true" multiValued="false" /> <field name="company" type="text" indexed="true" stored="true" multiValued="false" /></fields> 32Saturday, November 10, 12
  • 33. <fields> <!-- general --> <field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/> <field name="django_ct" type="string" indexed="true" stored="true" multiValued="false" /> <field name="django_id" type="string" indexed="true" stored="true" multiValued="false" /> <!-- dynamic --> <dynamicField name="*_i" type="sint" indexed="true" stored="true"/> <dynamicField name="*_s" type="string" indexed="true" stored="true"/> <dynamicField name="*_l" type="slong" indexed="true" stored="true"/> <dynamicField name="*_t" type="text" indexed="true" stored="true"/> <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> <dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/> <dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/> <dynamicField name="*_dt" type="date" indexed="true" stored="true"/> <!-- app --> <field name="bio" type="text" indexed="true" stored="true" multiValued="false" /> <field name="title" type="text" indexed="true" stored="true" multiValued="false" /> <field name="text" type="text" indexed="true" stored="true" multiValued="false" /> <field name="abstract" type="text" indexed="true" stored="true" multiValued="false" /> <field name="full_name" type="text" indexed="true" stored="true" multiValued="false" /> <field name="company" type="text" indexed="true" stored="true" multiValued="false" /></fields> 33Saturday, November 10, 12
  • 34. <fields> <!-- general --> <field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/> <field name="django_ct" type="string" indexed="true" stored="true" multiValued="false" /> <field name="django_id" type="string" indexed="true" stored="true" multiValued="false" /> <!-- dynamic --> <dynamicField name="*_i" type="sint" indexed="true" stored="true"/> <dynamicField name="*_s" type="string" indexed="true" stored="true"/> <dynamicField name="*_l" type="slong" indexed="true" stored="true"/> <dynamicField name="*_t" type="text" indexed="true" stored="true"/> <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> <dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/> <dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/> <dynamicField name="*_dt" type="date" indexed="true" stored="true"/> <!-- app --> <field name="bio" type="text" indexed="true" stored="true" multiValued="false" /> <field name="title" type="text" indexed="true" stored="true" multiValued="false" /> <field name="text" type="text" indexed="true" stored="true" multiValued="false" /> <field name="abstract" type="text" indexed="true" stored="true" multiValued="false" /> <field name="full_name" type="text" indexed="true" stored="true" multiValued="false" /> <field name="company" type="text" indexed="true" stored="true" multiValued="false" /></fields> 34Saturday, November 10, 12
  • 35. Copy Fields SEARCHING WITH SOLR • Two Main Uses • Analyze fields in different ways • Concatenate Fields 35Saturday, November 10, 12
  • 36. <copyField source="bio" dest="df_text" /> <copyField source="year" dest="century" maxChars="2"/> 36Saturday, November 10, 12
  • 37. <copyField source="bio" dest="df_text" /> <copyField source="year" dest="century" maxChars="2"/> 37Saturday, November 10, 12
  • 38. <copyField source="bio" dest="df_text" /> <copyField source="year" dest="century" maxChars="2"/> 2000 would be stored as 20 Useful for custom faceting 38Saturday, November 10, 12
  • 39. The Solr Config File SEARCHING WITH SUB HEADLINE SOLR • solrconfig.xml • Defines request handlers, defaults, & caches • Read the solrconfig.xml inside the example project for more 39Saturday, November 10, 12
  • 40. Other Solr Tools SEARCHING WITH SUB HEADLINE SOLR • Debug Query • Boost Functions • Search Faceting • Search Filters • Search Highlighting • Solr Admin 40Saturday, November 10, 12
  • 41. Debug Query Option SEARCHING WITH SUB HEADLINE SOLR • Add &debugQuery=on to request parameters • Returns a parsed form of the query 41Saturday, November 10, 12
  • 42. <lst name="debug"> <str name="rawquerystring">solr</str> <str name="querystring">solr</str> <str name="parsedquery">text:solr</str> <str name="parsedquery_toString">text:solr</str> <lst name="explain"> <str name="codecamp.session.19"> 1.2147729 = (MATCH) fieldWeight(text:solr in 17), product of: 1.4142135 = tf(termFreq(text:solr)=2) 3.9267395 = idf(docFreq=2, maxDocs=56) 0.21875 = fieldNorm(field=text, doc=17) </str> </lst> 42Saturday, November 10, 12
  • 43. <lst name="debug"> <str name="rawquerystring">solr</str> <str name="querystring">solr</str> <str name="parsedquery">text:solr</str> <str name="parsedquery_toString">text:solr</str> <lst name="explain"> <str name="codecamp.session.19"> 1.2147729 = (MATCH) fieldWeight(text:solr in 17), product of: 1.4142135 = tf(termFreq(text:solr)=2) 3.9267395 = idf(docFreq=2, maxDocs=56) 0.21875 = fieldNorm(field=text, doc=17) </str> </lst> 43Saturday, November 10, 12
  • 44. Boost Function SEARCHING WITH SUB HEADLINE SOLR • Allows you to influence results at query time • Really useful for tuning scoring • You can also boost at index time 44Saturday, November 10, 12
  • 45. Boost Function SEARCHING WITH SUB HEADLINE SOLR • Allows you to influence results at query time • Really useful for tuning scoring • You can also boost at index time q=blend&qf=text^2 company 45Saturday, November 10, 12
  • 46. Boost Function SEARCHING WITH SUB HEADLINE SOLR • Allows you to influence results at query time More information available - • Really useful for tuning scoring http://wiki.apache.org/solr/ SolrRelevancyFAQ Can use both dismax and • You can also boost at index time standard query handlers, I use dismax q=blend&qf=text^2 company 46Saturday, November 10, 12
  • 47. Boost Function SEARCHING WITH SUB HEADLINE SOLR • Allows you to influence results at query time More information available - • Really useful for tuning scoring http://wiki.apache.org/solr/ SolrRelevancyFAQ Can use both dismax and • You can also boost at index time standard query handlers, I use dismax &bq=text:blend^2 47Saturday, November 10, 12
  • 48. Solr Faceting SEARCHING WITH SUB HEADLINE SOLR • What is a facet? • “Interaction style where users filter a set of items by progressively selecting from only valid values of a  faceted classification system” - Keith Instone, SOASIS&T, July 8, 2004 • What does it look like? • Make sure to use an untokenized field (e.g. string) • “San Jose” != “san”+“jose” 48Saturday, November 10, 12
  • 49. q=*:* facet=on facet.field=company 49Saturday, November 10, 12
  • 50. Solr Filter Query SEARCHING WITH SUB HEADLINE SOLR • Used to narrow your search query • Restrict the super set of documents that can be returned • ‘fq’ parameter (short for Filter Query) 50Saturday, November 10, 12
  • 51. Solr Filter Query SEARCHING WITH SUB HEADLINE SOLR • Used to narrow your search query • Restrict the super set of documents that can be returned • ‘fq’ parameter (short for Filter Query) q=*:* fq=company:blend 51Saturday, November 10, 12
  • 52. Search Highlighting SEARCHING WITH SUB HEADLINE SOLR • Allow Solr to generate your highlight 52Saturday, November 10, 12
  • 53. Search Highlighting SEARCHING WITH SUB HEADLINE SOLR • Allow Solr to generate your highlight 53Saturday, November 10, 12
  • 54. hl=true hl.simple.pre=<b> hl.simple.post=</b> hl.fragsize=200 hl.requireFieldMatch=false hl.fl=text bio title hl.snippets=1 54Saturday, November 10, 12
  • 55. Solr Admin SEARCHING WITH SUB HEADLINE SOLR • http://localhost:8983/solr/admin/ • Built in app for testing all search options • Field Analysis • Schema Browser • Full Query Interface • Solr Statistics • Solr Information • Many More Options 55Saturday, November 10, 12
  • 56. Solr/Browse SEARCHING WITH SUB HEADLINE SOLR • Test your search configuration using the /browse requestHandler 56Saturday, November 10, 12
  • 57. Resources SEARCHING WITH SUB HEADLINE SOLR • Apache Solr Website • http://lucene.apache.org/solr/ • Wiki, mailing list, bugs/features • Books 57Saturday, November 10, 12
  • 58. 58Saturday, November 10, 12