code4lib 2011 preconference: What's New in Solr (since 1.4.1)


Published on

code4lib 2011 preconference, presented by Erik Hatcher of Lucid Imagination.

Abstract: The library world is fired up about Solr. Practically every next-gen catalog is using it (via Blacklight, VuFind, or other technologies). Solr has continued improving in some dramatic ways, including geospatial support, field collapsing/grouping, extended dismax query parsing, pivot/grid/matrix/tree faceting, autosuggest, and more. This session will cover all of these new features, showcasing live examples of them all, including anything new that is implemented prior to the conference.

Published in: Technology

code4lib 2011 preconference: What's New in Solr (since 1.4.1)

  1. 1. Whats New in Solr? code4lib 2011 preconference Bloomington, INpresented by Erik Hatcher of Lucid Imagination
  2. 2. about mespoken at several code4lib conferences Keynoted Athens 07 along with the pioneering Solr preconference, Providence 09, "Rising Sun" pre-conferenced Asheville 10, "Solr Black Belt"co-authored "Lucene in Action", first edition; ghost/toast on second editionLucene and Solr committer.library world claims to fame are founding and naming Blacklight, original developer onCollex and the Rossetti Archive searchnow at Lucid Imagination, dedicated to Lucene/Solr support/services/training/etc
  3. 3. abstract The library world is fired up about Solr. Practically everynext-gen catalog is using it (via Blacklight, VuFind, or other technologies). Solr has continued improving in some dramatic ways, including geospatial support, fieldcollapsing/grouping, extended dismax query parsing, pivot/ grid/matrix/tree faceting, autosuggest, and more. This session will cover all of these new features, showcasing live examples of them all, including anything new that is implemented prior to the conference.
  4. 4. LIA2 - Lucene in ActionPublished: July 2010 - in this second edition: Performing hot backups Using numeric fields Tuning for indexing or searching speed Boosting matches with payloads Creating reusable analyzers Adding concurrency with threads Four new case studies, and more
  5. 5. Version NumberWhich one ya talking bout, Willis? 3.1? 4.0?? TRUNK??playing with fire index format changes to be expected reindexing recommended/requiredSolr/Lucene merged development codebases releases should occur lock-step moving forward
  6. 6. dependenciesNovember 2009: Solr 1.4 (Lucene 2.9.1)June 2010: Solr 1.4.1 (Lucene 2.9.3)Spring 2011(?): Solr 3.1 (Lucene 3.1)TRUNK: Solr 4.x (Lucene TRUNK)
  7. 7. luceneper-segment field cache, etcUnicode and analysis improvements throughoutAnalysis "attributes"AutomatonQuery: RegexpQuery, WildcardQueryflexible indexingand so much more!
  8. 8. READMEReindex!Upgrade SolrJ libraries too (javabin formatchanged)Read Lucene and Solrs CHANGES.txt files for allthe details
  9. 9. AnalysisUAX, using ICUCollationKeyPatternReplaceCharFilterKeywordMarkerFilterFactory,StemmerOverrideFilterFactory
  10. 10. Standard tokenizationClassicTokenizer: old StandardTokenizerStandardTokenizer: now uses Unicode textsegmentation specified by UAX#29UAX29URLEmailTokenizermaxTokenLength: default=255
  11. 11. PathHierarchyTokenizerdelimiter: default=/replace: default=<delimiter>"/foo/bar" => [/foo] [/foo/bar]
  12. 12. CollationKeyFilterA filter that lets one specify: A system collator associated with a locale, or A collator based on custom rulesThis can be used for changing sort order for non-english languages as well asto modify the collation sequence for certain languages. You must use the sameCollationKeyFilter at both index-time and query-time for correct results. Also,the JVM vendor, version (including patch version) of the slave should be exactlysame as the master (or indexer) for consistent results. also: ICUCollationKeyFilter
  13. 13. ICUInternational Components for UnicodeICUFoldingFilterICUNormalizer2Filter name=nfc|nfkc|nfkc_cf mode=compose|decompose filter
  14. 14. ICUFoldingFilterAccent removal, case folding,canonical duplicates folding,dashesfolding,diacritic removal (including stroke, hook, descender), Greek letterformsfolding, Han Radical folding, Hebrew Alternates folding, Jamo folding,Letterforms folding, Math symbol folding, Multigraph Expansions: All, Nativedigit folding, No-break folding, Overline folding, Positional forms folding, Smallforms folding, Space folding, Spacing Accents folding, Subscript folding,Superscript folding, Suzhou Numeral folding, Symbol folding, Underline folding,Vertical forms folding, Width foldingAdditionally, Default Ignorables are removed, and text is normalized to NFKC. All foldings, case folding, and normalization mappings are applied recursivelyto ensure a fully folded and normalized result.
  15. 15. ICUTransformFilterid: specific transliterator identifier from ICUsTransliterator#getAvailableIDs()(required)direction=forward|reverseExamples: Traditional-Simplified: => Cyrillic-Latin: Российская Федерация => Rossijskaâ Federaciâ
  16. 16. Tom Burton-WestslatestICUshinglesquery parserABC -> [A] [B] [C] or [AB] [BC]...
  17. 17. highlighterdeprecated old config, now config as standardsearch componentFastVectorHighlighter
  18. 18. FastVectorHighlighterif termVectors="true", termPositions="true", andtermOffsets="true"and hl.useFastVectorHighlighter=true hl.fragListBuilder hl.fragmentsBuilder
  19. 19. spatialJTeams plugin: packaged for easy deploymentSolr trunk capabilitiesmany distance functionsWhats missing? geo faceting? scoring by distance? distance pseudo-field?All units in kilometers, unless otherwise specified
  20. 20. Spatial field typesPoint: n-dimensional, must specify dimension(default=2), represented by N subfields internallyLatLon: latitude,longitude, represented by twosubfields internally, single valued onlyGeoHash: single string representation of lat/lon
  21. 21. Spatial query parsersgeofilt: exact filteringbbox: uses (trie) range queriesParameters: sfield: spatial field pt: reference point d: distance
  22. 22. field collapsing/groupingbackwards compatibility mode? sort: how to sort groups, by top document in each group group.sort: how to sort docs within each groupgroup=true group.format: grouped | simplegroup.field / group.func / group.query group.main=true|false:rows / start: for groups, not documents faceting works as normalgroup.limit: number of results pergroup not distributed savvy yetgroup.offset: offset into doclist of eachgroup
  23. 23. query parsingTextField: autoGeneratePhraseQueries="true" if single string analyzes to multiple tokens
  24. 24. {!raw|term|field f=$f}...Recall why we needed {!raw} from last year<fieldType = .../> - use one string, one numeric, (and one text?)<field name="..."/>table for numeric and for string (and text?): {!raw f=$f} | TermQuery(...) {!term f=$f} | ... {!field f=$f} | ...Which to use when? {!raw} works for strings just fine, but best to migrate to the generallysafer/wiser {!term} for future-proofing.
  25. 25. {!term f=field}fq={!term f=weight}1.5
  26. 26. dismaxq.op or schema.xmls <solrQueryParserdefaultOperator="[AND|OR]"/> defaults mm to 0%(OR) or 100% (AND)#code4lib: issues with non-analyzed fields in qf
  27. 27. edismaxSupports full lucene query syntax in the absence of syntax errorssupports "and"/"or" to mean "AND"/"OR" in lucene syntax modeWhen there are syntax errors, improved smart partial escaping of special characters is done to preventthem... in this mode, fielded queries, +/-, and phrase queries are still supported.Improved proximity boosting via word bigrams... this prevents the problem of needing 100% of the words inthe document to get any boost, as well as having all of the words in a single field.advanced stopword handling... stopwords are not required in the mandatory part of the query but are stillused (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be)then all will be required.Supports the "boost" parameter.. like the dismax bf param, but multiplies the function query instead ofadding it inSupports pure negative nested queries... so a query like +foo (-foo) will match all documents
  28. 28. function queriestermfreq, tf, docfreq, idf, norm, maxdoc, numdocs{!func}termfreq(text,ipod)standard java.util.Math functions
  29. 29. facetingper-segment, single-valued fields: facet.method=fcs (field cache per segment) facet.field={!threads=-1}field_name threads=0: direct execution threads=-1: thread per segment speeds up single and multivalued method=fc, especially for deep paging with facet.offsetdate faceting improvements, generalized for numeric ranges toocan now exclude main query q={!tag=main}the+query&facet.field={!ex=main}category
  30. 30. pivot/grid/matrix/treefacetingis this also "hierarchical faceting"? it depends!
  31. 31. pivot faceting output/select?q=*:*&rows=0&facet=on&facet.pivot=cat,popularity,inStock&facet.pivot=popularity,cat
  32. 32. spell checkingDirectSolrSpellChecker no external index needed, uses automaton on main index
  33. 33. spellcheck configsolrconfig.xml <searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">textgen</str> <!-- a spellchecker that uses no auxiliary index --> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spell</str> <str name="classname">solr.DirectSolrSpellChecker</str> <str name="minPrefix">1</str> </lst> </searchComponent>
  34. 34. spellcheck handlersolrconfig.xml <requestHandler name="standard" class="solr.SearchHandler" default="true"> <!-- default values for query parameters --> <lst name="defaults"> <str name="echoParams">explicit</str> <str name="spellcheck">true</str> <str name="spellcheck.collate">true</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>
  35. 35. spellcheck responsehttp://localhost:8983/solr/select?q=ipud%20bluck&wt=ruby&indent=on { responseHeader=>{ status=>0, QTime=>10, params=>{ indent=>on, wt=>ruby, q=>ipud bluck}}, response=>{numFound=>0,start=>0,docs=>[] }, spellcheck=>{ suggestions=>[ ipud,{ numFound=>1, startOffset=>0, endOffset=>4, suggestion=>[ipod]}, bluck,{ numFound=>1, startOffset=>5, endOffset=>10, suggestion=>[black]}, collation,ipod black]}}
  36. 36. autosuggestnew "spellcheck" component, builds TSTcollates querycan check if collated suggestions yield results,optionally, providing hit count
  37. 37. suggest configsolrconfig.xml <searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">textgen</str> <lst name="spellchecker"> <str name="name">suggest</str> <str name="classname">org.apache.solr.spelling.suggest.Suggester</str> <str name="lookupImpl"> org.apache.solr.spelling.suggest.jaspell.JaspellLookup </str> <str name="field">suggest</str> <str name="buildOnCommit">true</str> </lst> </searchComponent>schema.xml <field name="suggest" type="textgen" indexed="true" stored="false"/> <copyField source="name" dest="suggest"/>
  38. 38. suggest handlersolrconfig.xml <requestHandler class="solr.SearchHandler" name="/suggest"> <lst name="defaults"> <str name="spellcheck">true</str> <str name="spellcheck.dictionary">suggest</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.count">10</str> <str name="rows">0</str> <str name="spellcheck.maxCollationTries">20</str> <str name="spellcheck.maxCollations">10</str> <str name="spellcheck.collateExtendedResults">true</str> </lst> <arr name="components"> <str>query</str> <!-- to allow suggestion hit counts to be returned --> <str>spellcheck</str> </arr> </requestHandler>
  39. 39. suggest responsehttp://localhost:8983/solr/suggest?q=ip&wt=ruby&indent=on { responseHeader=>{ status=>0, QTime=>2}, response=>{numFound=>0,start=>0,docs=>[] }, spellcheck=>{ suggestions=>[ ip,{ numFound=>1, startOffset=>0, endOffset=>2, suggestion=>[ipod]}, collation,[ collationQuery,ipod, hits,3, misspellingsAndCorrections,[ ip,ipod]]]}}
  40. 40. sortby function &q=*:*&sfield=store&pt=39.194564,-86.432947& sort=geodist() ascbut still cant get value of function back unless you force it to be the score somehow
  41. 41. clustering componentnow works out-of-the-box; all Apache licensecompatiblesupports distributed search
  42. 42. debug=truedebug=true|all|timing|query|resultsdebug=results&debug.explain.structured=true
  43. 43. structured explainhttp://localhost:8983/solr/select?q=title:solr&debug.explain.structured=true&debug=results&wt=ruby&indent=on debug=>{ explain=>{ doc1=>{ match=>true, value=>0.076713204, description=>fieldWeight(title:solr in 0), product of:, details=>[{ match=>true, value=>1.0, description=>tf(termFreq(title:solr)=1)}, { match=>true, value=>0.30685282, description=>idf(docFreq=1, maxDocs=1)}, { match=>true, value=>0.25, description=>fieldNorm(field=title, doc=0)}]}}}}
  44. 44. SolrCloudshared/central config and core/shard managmentvia zookeeper,built-in load balancing, and infrastructure for futureSolrCloud work.
  45. 45. /update/jsonsolrconfig.xml <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/> curl http://localhost:8983/solr/update/json?commit=true -H Content-type:application/json -d { "add": { "doc": { "id" : "MyTestDocument", "title" : "This is just a test" } } }
  46. 46. wt=csvWrites only docs (no response header or responseextras) in CSV formatRoundtrippable with /update/csv provided all fields are stored
  47. 47. UIMAUnstructured Information ManagementArchitecture update processor chain, augmentingincoming documents from a UIMA annotatorpipeline
  48. 48. (solr|lucene)-devant [idea|eclipse]go!
  49. 49. works in progresssome interesting open issues (with patches): PayloadTermQuery XMLQueryParser plugin join
  50. 50. {!join from=$f to=$t}insert <what Yonik said> SOLR-2272
  51. 51. Lucid (imagination)Whats Lucid done for you lately - Yonik, Mark, Grant, Hoss: Lucene and Solr performance, faceting, grouping, join query, spatial, Mahout, ORP, PMC, etc, etc, etc Other technical staff involved in mailing list assistance, bug reporting, contributing patches (hi Lance, Erick, Jay, Tom, Grijesh, Tomas....) extended dismax, join, faceting performance improvements LucidWorks Enterprise
  52. 52. Hoss Simplicity
  53. 53. LucidWorks Enterprise "lucid" query parser REST API click boosting Data sources, crawlers, and tunable norms, per- scheduling field Alerts role filtering administrative UI
  54. 54. Community Questionsfire away!
  55. 55. resourcesduh!:<your query>
  56. 56. Q&A: facetingwhy is paging through facets the way it is? short-circuits on enum
  57. 57. Community:- The state of Extended DisMax, and what Lucene featuresremain incompatible with it.- Any developments on faceting (Ive implemented thestandard workaround to the "unknown facet list size"problem...  but Id still love to be able to know exactly howlong the lists are)- Hierarchical documents in Solr -- I havent followed theconversations closely, but I gather that this topic is gainingsome momentum in the Solr community.
  58. 58. contact infoerik.hatcher @ lucidimagination . com webinars, documentation LucidFind: search mailing list posts, wiki pages, web sites, our blog, etc for latest Lucene/Solr assistance
  59. 59. re: code4lib