Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Solr: Beyond the Basics


Published on

The Apache Solr search engine has become almost the default choice for adding superior search capabilities to a web application. In this talk we will go beyond the basics of Solr, and look up at what it offers and how to set it up robustly and properly for production use. We will plan and implement a document model in Solr, and look at how to index different document types with Solr Cell or index data from the web with the Nutch crawler. We will cover options for tuning queries and performance, and examine how best to use more advanced features like faceting, spelling correction and 'more like this'. Solr offers a language agnostic web service, so client examples will be in PHP and Python, but the bulk of the content will be applicable to anyone looking to work well with Solr.

Published in: Technology

Solr: Beyond the Basics

  1. 1. Solr: BEYOND THE BASICS!script: Ian barber ( the internet!Editor: ian.barber@gmail.com
  2. 2. P REVIOUSLY.... My site search wasslow and theresults werebad, but Solr ∑knk,j ni,j saved me! tfi,j ∑k nk,j x id ∈ d }| fi,j | {d:t i
  3. 3. securitycomes first!
  4. 4. /etc/solr/solr.xmlCore CoreCONF CONF /var/solr/data /var/solr/lib
  5. 5. <solr sharedLib="/var/solr/lib" persistent="true"> <cores adminPath="/admin/cores"> <core default="true" instanceDir="main" name="main"> </core> </cores></solr> S olr.xml
  6. 6. y pla dis for <co u sed nfi nl y ch <!- g> s o ear - S an d i th e s et e ma of env iro enc this sch ure to nd nme ounte 1. 2"> t his nat x a nt, red fals n=" me of the ynt a con an e r sio na fl ect em a s y fig you m sev if you ve e re ch d b ure ay ere le" th to e s lue d. wan con wa p xam ame" i s t his o r t h t iVa t s olr fig nt so e mul ura l e =" n ge r f to tio r to nam bute " han mbe are You kee n e c ma d c nu s may rro ontin at tri sh oul s ion fi eld als p w ork r. ue ver . lt --> -Dsol -- ns ons , all fau o s ing In wo es. licati o o lr s i i cat xist t aul e by d e <ab r.a bor et th eve a p rkingpos App def tru or n i rod s S ppl e abo rtO tOnCo tOn i uct after " i y a d not by Con s to f o ne ion n. 1.2 d b se d, nCo n nfi figur fig f it c tio sion=" ould ha nge ute di , fal roduce gur ati ura alse han dle e c ed tio usioll ver sh be rib duc int <! ati o n r i It ly att tro ribute ide -- li onE nErro nEr ror g by s m is- i cs. normal alued e in t nti b d rro r> r >$ {so =fa set ant not iV but ons at fie ire lr. lse tin g t sem mul t ttr i i d cti ves abo he 1. 0: ue d a dPosit sol rco and can rtO nCo sys l use tem iVa FreqAn s ss" nfi be nfi e t mul Term t e i e "cla g.x ml the m t use gur pro per na tur 1. 1: mi t ds . tr ibu Th l sch ema or o r d t o i ati onE ty: : o t fiel t " a tions. he re a he .xm eso nst rro 1.2 tex e nam fini t in t l ( lve any ruc t S r:t rue o r h e " e m ine s es All ie: olr }< / ep t f ns . T ield d deter c las dir ect Ana lyz "pl ugi to exc --> iti o y f tes jav a ori ers ns" loa e fin used b tribu to it If es , R equ spe cif d a n J e d e at ref er m .a ". and est ied ars es > d typ t o b other r" rb ati hich/lib pat hs Han in p l l e . l e ad w " d <ty -- fie abe d any Typ th "so . d v r ch ire are dle you <! t a l an fi eld wi ag e st ore reshol e s nc i i wh cto res rs, r us e e g ck d/ h ue lud ry olv etc j ibu t th n rti sis pa exe ssT val ed exi ed ... a ttr ior of s sta y ind compre ) to as sts rel ). av me ana l but nal lds --> if you in ati ve beh ass na solr. zed , io fie <!-- <li " b had you the C l ch e. an aly an opt rived A d e tru dir= use r i nst ins e ir t =" "./ . apa not pport h e d claL as op lib d t anc eDi tan ceD org is u n t ). -->s ing ssp tio " / he r, ir. t ype ield s led i s Mis ath n b y i > fol low all d er rt - <!- , t --> iel xtF f enab aract " s< o l his tse ing fil S trF and Te ( i ch i eld !-- ib di is " lf ad ue syn tax es fou T he l d si on (in S trF in Whe r=" ../ st= "tr use ds ... nd - e e . tha ful any in <!- t rFi ompres n siz s olr t n a re n gLa ./c . for fil - S ts c ai s=" ssi gex ont 4 es i ert cla s dir rtM i isase6rib/ inc fo lim ed a c g" soec B sp ext lud ing und i e r in -> d" i l w l tor n as eci rac n t exc "st " - Fie l b ed i hy fie d i tion/ all me= lse r.Bool -> - e i w ich jar he di S na v ie nc l "fa sol <!- -li /retr com n a ddi ib" / s i r n a ector --> ldType "/> or =" tb d lud ed. ple tio --> y e e e" las s <lisen tare dir < f i s =" t r u tru be b dir ir= ely n t ect to th rm e : " ean" c ld <!- = " . "/ ".. > / ut es mat o a ory e tNo typ bool hou ou f - I eld /. . ../ rib ch dir . olr’s secret plan! omi n s nd arf Fa y i d ./d t distatt / the ect lea =" ata reg ory boo e name e d Bin ir i irs st/" " r ex , o - <!- ldTyp Th r. gF ti op ege (an nly e" /> yp e. =" sol is sin on reg ex= x=" apa cho the e t ss tM (wi < f i s =" t r u d a t a cla sor th "ap ach che-s red on fil es N orm inary - > a ry" and or wit e-s olr bot o mit !--B s - e =" b i n L ast hou olr -ce h e < ing ing t a -cl ll- nds d Str pe nam M iss reg uster d. *. ) ode eldty s ort ex) ing jar enc <fi al is - " / pt ion use d.*. > o d a jar The nd not " / - < !-- hin -> g i s
  7. 7. <listener event="firstSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst> <str name="q">solr rocks</str> <str name="start">0</str> <str name="rows">10</str> </lst> <lst> <str name="q">from solrconfig.xml</str> </lst> </arr></listener> cache warming!
  8. 8. QueryIndex Configuration Request Handlerssearch components
  9. 9. Content fields Typesection field typessearchtypes
  10. 10. THe cms!
  12. 12. permalinkCategory Author Tags
  13. 13. Scientificanalysis!how do we turn our text into tokens?Field Type, Storage, Tokenisation, Filters, and copy fields.
  14. 14. <fieldType name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.StopFilterFactory"/> <filter class="solr.WordDelimiterFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory"/> </analyzer></fieldType> S chema.xml
  15. 15. ORIGINAL STANDARD O Reilly S O’Reilly’s wi FI wi-fi guide! GUIDEkeyword Whitespace O’Reilly’s O’Reilly’s wi-fi wi-fi guide! guide!
  16. 16. doc 1 “My Phrase?”stored INDEXED my doc 1 “MyPhrase?” phrase doc 1
  17. 17. Ian barber IAIN BARBOUR AN PRPR AN PRPR<fieldtype name="phonetic" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/> </analyzer></fieldtype>
  18. 18. <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" generateNumberParts="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>delimiters O Reilly S OReillys wi FI wifi GUIDE
  19. 19. precision versus recall vs
  20. 20. <filterclass="solr.SnowballPorterFilterFactory"language="English"protected="protwords.txt" />stemming O Reilli S OReilli wi FI wifi GUID
  21. 21. Je ne parle pas anglais!
  23. 23. <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0" /><fieldType name="lowercase" class="solr.TextField"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer></fieldType> S chema.xml
  24. 24. permalink Datecategory tags author
  25. 25. <fields><field name="permalink" type="lowercase" required="true" /><field name="category" type="lowercase" /><field name="tag" type="lowercase" multiValued="true" /><field name="title" type="text" required="true"/><field name="body" type="text" required="true" /><field name="author" type="lowercase" stored="false" multiValued="true" /><field name="date" type="tdate" multiValued="true" /><field name="lead_para" type="text" /><field name="phonetic" type="phonetic" /><field name="text" type="text" stored="false" multiValued="true" /> S</fields> chema.xml
  26. 26. <!-- Copy Fields --><copyField source="permalink" dest="text" /><copyField source="category" dest="text" /><copyField source="title" dest="text" /><copyField source="lead_para" dest="text" /><copyField source="body" dest="text" /><copyField source="author" dest="text" /><copyField source="category" dest="phonetic" /><copyField source="title" dest="phonetic" /><copyField source="lead_para" dest="phonetic" /><copyField source="body" dest="phonetic" /><copyField source="author" dest="phonetic" /><!-- ID --><uniqueKey>permalink</uniqueKey>
  27. 27. from solr import *s=SolrConnection( http://localhost:8080/solr/main)doc = dict( permalink = "", category = "strategy", title = "DPCO: A Framework For Synergy", body = "DPCO, or Dynamic Performance ClassOrganisation is a ISO90210 quality orientedmanagement process [...]", author = "Sean Alison", date = "2011-03-01T00:00:00Z", source_site = "",) ss.add(doc)s.commit()
  28. 28. <add> <doc> <field name="body"> DPCO, or Dynamic Performance Class [...] </field> <field name="category">strategy</field> <field name="permalink"> </field> <field name="source_site"></field> <field name="title"> DPCO: A Framework For Synergy </field> <field name="date">2011-03-01T00:00:00Z </field> <field name="author">Sean Alison</field> </doc></add>
  29. 29. time for the gadgets!
  30. 30. <requestHandler name="/dataimport"class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config"> db-data-config.xml </str> </lst></requestHandler> S olrconfig.xml
  31. 31. <dataConfig> D ata-config.xml<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/cms" user="root" password="password" /><document> <entity name="story" query="SELECT, s.content, CONCAT (u.first_name, , u.last_name) as author [...] s.status_id = 1" deltaImportQuery="SELECT, s.content [...] AND = ${}" deltaQuery="SELECT id FROM stories WHERE modified > ${dataimporter.last_index_time}" transformer= "TemplateTransformer,HTMLStripTransformer" >
  32. 32. <field column="permalink" name="permalink" template="${story.slug}" /> <field column="publish_date" name="date" /> <field column="content" name="body" stripHTML="true" /> <field column="source_site" template="cms" /> [...] <entity name="topic" query="SELECT [...] st.item_id=${}"> <field column="category" /> </entity> </entity></document></dataConfig>
  33. 33. <response> <str name="command">full-import</str> <str name="status">busy</str> <str name="importResponse"> A command is still running...</str> <lst name="statusMessages"> <str name="Time Elapsed">0:0:14.979</str> <str name="Total Requests made">5523</str> <str name="Total Rows Fetched">5522</str> <str name="Total Documents Processed"> 2760</str> <str name="Total Documents Skipped">0</str> <str name="Full Dump Started"> 2011-03-02 15:48:00</str> </lst></response> http://SOLR:8080/solr/main/dataimport
  34. 34. The SOLR CELL!
  35. 35. <requestHandler name="/update/extract"class="org.apache.solr.handler.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="uprefix">ignored_</str> </lst></requestHandler> S olrconfig.xml
  36. 36. <fieldtype name="ignored" stored="false"indexed="false" multiValued="true"class="solr.StrField" /> S chema.xml
  37. 37. <dynamicField name="ignored_*" type="ignored" indexed="false" stored="false"/> can it be... schema free?!D ynamic Fields
  38. 38. $  curl  -­‐v  “http://localhost:8080/solr/main/update/extract?literal.source_site=files&literal.permalink=—data-­‐binary  @arch.pdf  -­‐H  ‘Content-­‐Type:application/pdf’
  39. 39. A crawler!
  40. 40. # skip some protocols-^(https|telnet|file|ftp|mailto):-[?*!@=]# allow urls in defined domain+^http://([a-z0-9-A-Z]*.)* skip URLs with slash-delimited segment thatrepeats 3+ times, to break loops-.*(/[^/]+)/[^/]+1/[^/]+1/# deny anything else-. r egex-urlfilter.txt
  41. 41. <mapping> <fields> <field dest="body" source="content" /> <field dest="source_site" source="site" /> <field dest="title" source="title" /> <field dest="ignored_host" source="host" /> <field dest="ignored_segment" source="segment" /> <field dest="ignored_boost" source="boost" /> <field dest="ignored_digest" source="digest" /> <field dest="date" source="tstamp" /> <field dest="permalink" source="url" /> </fields> <uniqueKey>permalink</uniqueKey> S</mapping> olrindex-mapping.xml
  42. 42. $  echo  ""  >  urls/seed.txt$  bin/nutch  inject  /var/nutch/crawldb  urls$  bin/nutch  generate  /var/nutch/crawldb                                            /var/nutch/segments$  export  SEGMENT=/var/nutch/segments/`ls  -­‐tr                                    /var/nutch/segments|tail  -­‐1`$  bin/nutch  fetch  $SEGMENT  -­‐noParsing$  bin/nutch  parse  $SEGMENT$  bin/nutch  updatedb  $SEGMENT  -­‐filter  -­‐normalize$  bin/nutch  invertlinks  /var/nutch/linkdb                                        -­‐dir  /var/nutch/segments$  bin/nutch  solrindex  http://localhost:8080/solr/main  /var/nutch/crawldb  /var/nutch/linkdb/  /var/nutch/segments/*
  43. 43. solr goes to work!
  44. 44. he has dismax!
  45. 45. <requestHandler name="dismax" class="solr.SearchHandler" default="true"> <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> text^0.5 category^1.5 title^2 body^1 permalink^10.0 author^1.8 tag^1.3 </str> <str name="pf"> text^0.2 title^4 author^1.8 body^1 </str> <str name="mm">3&lt;60%</str> S </lst></requestHandler> olrconfig.xml
  46. 46. from solr import *url = http://localhost:8080/solr/mains = SolrConnection(url)response = s.query(idie manager)for hit in response.results: print hit[title] print hit[body]$  python  Overview  of  the  IDIE  managerTo  help  with  those  implementing  IDIE  [...]IDIE:  The  801g  Of  Talent  ManagementInspiration-­‐Direction-­‐Influence  [...]
  47. 47. <str name="bf"> recip(ms(NOW,date),3.16e-11,1,1)</str>FunctionQuery(1.0/(3.16E-11*float(ms(const(1299450070912),date(date)))+1.0)), productof: 0.9974636 = 1.0/(3.16E-11*float(ms(const(1299450070912),date(date)=1299369600000))+1.0) 1.0 = boost 0.03730806 = queryNorm
  48. 48. going beyond just search results!
  49. 49. $solr = new Apache_Solr_Service( localhost, 8080, /solr/main);$query = "badly drawn";$p = array( facet => "true", facet.field => category, facet.mincount => 1,);$r = $solr->search($query, 0, 5, $p);foreach( $r->facet_counts->facet_fields->category as $cat => $count) { echo $cat, " ", $count, PHP_EOL;
  50. 50. $query = "";$p = array( q.alt => "*:*", "facet" => "true", "" => date, "" => "NOW/YEAR-6MONTHS", "" => "NOW/YEAR", "" => "+1MONTH", "fq" => "category: Reviews",);$r = $solr->search($query, 0, 0, $p);foreach($r->facet_counts->facet_dates->date as $date => $count) { echo $date, " ", $count, PHP_EOL;}
  51. 51. $query = "";$p = array( q.alt => "*:*", facet => "true", facet.mincount => 1, "facet.query" => array("title:gig", "title:album"), "fq" => "category:Reviews",);$r = $solr->search($query, 0, 0, $p);foreach($r->facet_counts->facet_queries as $query => $count) { echo $query, " ", $count, PHP_EOL;}
  52. 52. What Fields to facet? how to facet? what facets to show?
  53. 53. <requestHandler name="mlt" class="solr.MoreLikeThisHandler"> <lst name="defaults"> <str name="defType">mlt</str> <str name="mlt">true</str> <str name="mlt.fl">body title</str> <str name="mlt.match.include"> false </str> </lst></requestHandler> S olrconfig.xml
  54. 54. $solr = new Apache_Solr_Service (localhost, 8080, /solr/main);$query = "Losing my backpacking virginity";$p = array(qt => "mlt");$results = $solr->search($query, 0, 3, $p);foreach($results->response->docs as $doc) { echo $doc->title, PHP_EOL;}$  php  mltquery.php  Backpacking  across  USA  social  media  waySafe  solo  travel  on  New  York  holidaysCracking  The  Big  Apples  Big  10
  55. 55. THanks!script: Ian barber ( the internet!Editor: ian.barber@gmail.com
  56. 56. Some useful links!
  57. 57. Bonuscontent!
  58. 58. <searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType"> textSpell </str> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spell</str> <str name="buildOnCommit">true</str> <str name="spellcheckIndexDir"> /var/lib/solr/spellchecker </str> </lst> S</searchComponent> olrconfig.xml
  59. 59. <fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.StandardFilterFactory" /> S </analyzer></fieldType> chema.xml
  60. 60. [...] <int name="ps">10</int> <int name="qs">5</int> <strname="spellcheck.onlyMorePopular">true</str> <strname="spellcheck.extendedResults">false</str> <str name="spellcheck.count">1</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler> D ismax handler
  61. 61. $solr = new Apache_Solr_Service(localhost, 8080, /solr/main);$p = array( spellcheck => true, spellcheck.collate => true);$results = $solr->search("roose", 0, 5, $p);echo "Did you mean " . $results->spellcheck->suggestions->collation, PHP_EOL;$  php  spellquery.php  Did  you  mean  rose
  62. 62. include_once "Apache/Solr/Service.php";$solr = new Apache_Solr_Service( localhost, 8080, /solr/main);$query = "album review";$p = array(sort => title_sort desc);$res = $solr->search($query, 0, 10, $p);foreach($res->response->docs as $doc) { echo $doc->title, PHP_EOL;}<field name="title_sort" type="lowercase"indexed="true" stored="false" /><copyField source="title" dest="title_sort" />
  63. 63.$  php  sortquery.php  Zola  Jesus  album  review  -­‐  Stridulum  IIZero  7  album  review  -­‐  RecordZebra  and  GiraffeYoung  Knives  video  interview  part  2Young  Knives  -­‐  Road  to  V  winners  on  tourYou  Me  At  Six  @  Wembley  Arena,  LondonYou  Me  At  Six  -­‐  Hold  Me  DownYet  again...  Good  Shoes  @  ULU,  LondonYelle:  North  American  tour  reviewYelle:  interview  with  a  French  pop  artiste
  64. 64. <highlighting><fragmenter name="regex" class="[..]highlight.RegexFragmenter"><lst name="defaults"> <int name="hl.fragsize">70</int> <float name="hl.regex.slop">0.5</float> <str name="hl.regex.pattern"> [-w ,/n"]{20,200}</str></lst></fragmenter><formatter name="html" class="[...]highlight.HtmlFormatter" default="true"><lst name="defaults"> <str name="hl.simple.pre"><![CDATA[<em>]]></str> <str name=""><![CDATA[</em>]]></str></lst></formatter></highlighting>
  65. 65. $so = new Apache_Solr_Service(localhost,8080, /solr/main);$q = "album review";$r =$so->search($q,0,5,array(hl=>"true"));foreach($r->response->docs as $doc) { echo $r->highlighting->{$doc->permalink}->title[0], PHP_EOL;}$  php  highlightquery.php  Fenech  Soler  <em>album</em>  <em>review</em>Weezer  -­‐  Hurley  <em>album</em>  <em>review</em>Feeder  <em>album</em>  <em>review</em>  -­‐  Renegades
  66. 66. Replication sharding caching The masters of scaling are here!
  67. 67. from solr import *url = http://localhost:8080/solr/mains = SolrConnection(url)response = s.query(ISO90210)if(response.results.numFound == 0): print "No results found!"$  python  No  results  found! IS SOLR DEFEATED?
  68. 68. http://solrurl:8080/solr/main/admin/analysis.jsp
  69. 69. /solr/select/?q="iso 90210"&debugQuery=true<lst name="debug"> <str name="rawquerystring">"iso 90210"</str> <str name="querystring">"iso 90210"</str> <str name="parsedquery">+DisjunctionMaxQuery((body:"iso 90210")~0.01) DisjunctionMaxQuery((body:"iso90210")~0.01)</str>
  70. 70. /solr/select/?q=iso 90210&debugQuery=true<lst name="debug"> <str name="rawquerystring">iso 90210</str> <str name="querystring">iso 90210</str> <str name="parsedquery">+((DisjunctionMaxQuery((body:iso)~0.01)DisjunctionMaxQuery((body:90210)~0.01))~2)DisjunctionMaxQuery((body:"iso 90210")~0.01)</str> <str name="parsedquery_toString">+(((body:iso)~0.01 (body:90210)~0.01)~2)(body:"iso 90210")~0.01</str>
  71. 71. &explainother=902100.0 = (NON-MATCH) Failure to meet condition(s) ofrequired/prohibited clause(s) 0.0 = no match on required clause (body:"iso 90210") 0.0 = weight(body:"iso 90210" in 0), product of: 0.6953707 = queryWeight(body:"iso 90210"), product of: 3.8325815 = idf(body: iso=1 90210=1) 0.18143663 = queryNorm 0.0 = fieldWeight(body:"iso 90210" in 0), product of: 0.0 = tf(phraseFreq=0.0) 3.8325815 = idf(body: iso=1 90210=1) 0.15625 = fieldNorm(field=body, doc=0)
  72. 72. <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> text^0.5 category^1.5 title^2 body^1 permalink^10.0 author^1.8 tag^1.3 </str> <str name="pf"> text^0.2 title^4 author^1.8 body^1 </str> <str name="mm"> 3&lt;60%</str> <int name="ps">10</int> <int name="qs">5</int></lst> S olrconfig.xml
  73. 73. from solr import *url = http://localhost:8080/solr/mains = SolrConnection(url)response = s.query(ISO90210)if(response.results.numFound == 0): print "No results found!"$  python  DPCO:  A  Framework  For  SynergyDPCO,  or  Dynamic  Performance  Class  Organisation  is  a  ISO90210  quality  [...]