• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 

code4lib 2011 preconference: What's New in Solr (since 1.4.1)

on

  • 4,033 views

code4lib 2011 preconference, presented by Erik Hatcher of Lucid Imagination. ...

code4lib 2011 preconference, presented by Erik Hatcher of Lucid Imagination.

Abstract: The library world is fired up about Solr. Practically every next-gen catalog is using it (via Blacklight, VuFind, or other technologies). Solr has continued improving in some dramatic ways, including geospatial support, field collapsing/grouping, extended dismax query parsing, pivot/grid/matrix/tree faceting, autosuggest, and more. This session will cover all of these new features, showcasing live examples of them all, including anything new that is implemented prior to the conference.

Statistics

Views

Total Views
4,033
Views on SlideShare
4,033
Embed Views
0

Actions

Likes
4
Downloads
54
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

code4lib 2011 preconference: What's New in Solr (since 1.4.1) code4lib 2011 preconference: What's New in Solr (since 1.4.1) Presentation Transcript

  • Whats New in Solr? code4lib 2011 preconference Bloomington, INpresented by Erik Hatcher of Lucid Imagination
  • about mespoken at several code4lib conferences Keynoted Athens 07 along with the pioneering Solr preconference, Providence 09, "Rising Sun" pre-conferenced Asheville 10, "Solr Black Belt"co-authored "Lucene in Action", first edition; ghost/toast on second editionLucene and Solr committer.library world claims to fame are founding and naming Blacklight, original developer onCollex and the Rossetti Archive searchnow at Lucid Imagination, dedicated to Lucene/Solr support/services/training/etc
  • abstract The library world is fired up about Solr. Practically everynext-gen catalog is using it (via Blacklight, VuFind, or other technologies). Solr has continued improving in some dramatic ways, including geospatial support, fieldcollapsing/grouping, extended dismax query parsing, pivot/ grid/matrix/tree faceting, autosuggest, and more. This session will cover all of these new features, showcasing live examples of them all, including anything new that is implemented prior to the conference.
  • LIA2 - Lucene in ActionPublished: July 2010 - http://www.manning.com/lucene/New in this second edition: Performing hot backups Using numeric fields Tuning for indexing or searching speed Boosting matches with payloads Creating reusable analyzers Adding concurrency with threads Four new case studies, and more
  • Version NumberWhich one ya talking bout, Willis? 3.1? 4.0?? TRUNK??playing with fire index format changes to be expected reindexing recommended/requiredSolr/Lucene merged development codebases releases should occur lock-step moving forward
  • dependenciesNovember 2009: Solr 1.4 (Lucene 2.9.1)June 2010: Solr 1.4.1 (Lucene 2.9.3)Spring 2011(?): Solr 3.1 (Lucene 3.1)TRUNK: Solr 4.x (Lucene TRUNK)
  • luceneper-segment field cache, etcUnicode and analysis improvements throughoutAnalysis "attributes"AutomatonQuery: RegexpQuery, WildcardQueryflexible indexingand so much more!
  • READMEReindex!Upgrade SolrJ libraries too (javabin formatchanged)Read Lucene and Solrs CHANGES.txt files for allthe details
  • AnalysisUAX, using ICUCollationKeyPatternReplaceCharFilterKeywordMarkerFilterFactory,StemmerOverrideFilterFactory
  • Standard tokenizationClassicTokenizer: old StandardTokenizerStandardTokenizer: now uses Unicode textsegmentation specified by UAX#29UAX29URLEmailTokenizermaxTokenLength: default=255
  • PathHierarchyTokenizerdelimiter: default=/replace: default=<delimiter>"/foo/bar" => [/foo] [/foo/bar]
  • CollationKeyFilterA filter that lets one specify: A system collator associated with a locale, or A collator based on custom rulesThis can be used for changing sort order for non-english languages as well asto modify the collation sequence for certain languages. You must use the sameCollationKeyFilter at both index-time and query-time for correct results. Also,the JVM vendor, version (including patch version) of the slave should be exactlysame as the master (or indexer) for consistent results.http://wiki.apache.org/solr/UnicodeCollationsee also: ICUCollationKeyFilter
  • ICUInternational Components for UnicodeICUFoldingFilterICUNormalizer2Filter name=nfc|nfkc|nfkc_cf mode=compose|decompose filter
  • ICUFoldingFilterAccent removal, case folding,canonical duplicates folding,dashesfolding,diacritic removal (including stroke, hook, descender), Greek letterformsfolding, Han Radical folding, Hebrew Alternates folding, Jamo folding,Letterforms folding, Math symbol folding, Multigraph Expansions: All, Nativedigit folding, No-break folding, Overline folding, Positional forms folding, Smallforms folding, Space folding, Spacing Accents folding, Subscript folding,Superscript folding, Suzhou Numeral folding, Symbol folding, Underline folding,Vertical forms folding, Width foldingAdditionally, Default Ignorables are removed, and text is normalized to NFKC. All foldings, case folding, and normalization mappings are applied recursivelyto ensure a fully folded and normalized result.
  • ICUTransformFilterid: specific transliterator identifier from ICUsTransliterator#getAvailableIDs()(required)direction=forward|reverseExamples: Traditional-Simplified: => Cyrillic-Latin: Российская Федерация => Rossijskaâ Federaciâ
  • Tom Burton-WestslatestICUshinglesquery parserABC -> [A] [B] [C] or [AB] [BC]...
  • highlighterdeprecated old config, now config as standardsearch componentFastVectorHighlighter
  • FastVectorHighlighterif termVectors="true", termPositions="true", andtermOffsets="true"and hl.useFastVectorHighlighter=true hl.fragListBuilder hl.fragmentsBuilder
  • spatialJTeams plugin: packaged for easy deploymentSolr trunk capabilitiesmany distance functionsWhats missing? geo faceting? scoring by distance? distance pseudo-field?All units in kilometers, unless otherwise specified
  • Spatial field typesPoint: n-dimensional, must specify dimension(default=2), represented by N subfields internallyLatLon: latitude,longitude, represented by twosubfields internally, single valued onlyGeoHash: single string representation of lat/lon
  • Spatial query parsersgeofilt: exact filteringbbox: uses (trie) range queriesParameters: sfield: spatial field pt: reference point d: distance
  • field collapsing/groupingbackwards compatibility mode? sort: how to sort groups, by top document in each grouphttp://wiki.apache.org/solr/FieldCollapsing group.sort: how to sort docs within each groupgroup=true group.format: grouped | simplegroup.field / group.func / group.query group.main=true|false:rows / start: for groups, not documents faceting works as normalgroup.limit: number of results pergroup not distributed savvy yetgroup.offset: offset into doclist of eachgroup
  • query parsingTextField: autoGeneratePhraseQueries="true" if single string analyzes to multiple tokens
  • {!raw|term|field f=$f}...Recall why we needed {!raw} from last year<fieldType = .../> - use one string, one numeric, (and one text?)<field name="..."/>table for numeric and for string (and text?): {!raw f=$f} | TermQuery(...) {!term f=$f} | ... {!field f=$f} | ...Which to use when? {!raw} works for strings just fine, but best to migrate to the generallysafer/wiser {!term} for future-proofing.
  • {!term f=field}fq={!term f=weight}1.5
  • dismaxq.op or schema.xmls <solrQueryParserdefaultOperator="[AND|OR]"/> defaults mm to 0%(OR) or 100% (AND)#code4lib: issues with non-analyzed fields in qf
  • edismaxSupports full lucene query syntax in the absence of syntax errorssupports "and"/"or" to mean "AND"/"OR" in lucene syntax modeWhen there are syntax errors, improved smart partial escaping of special characters is done to preventthem... in this mode, fielded queries, +/-, and phrase queries are still supported.Improved proximity boosting via word bigrams... this prevents the problem of needing 100% of the words inthe document to get any boost, as well as having all of the words in a single field.advanced stopword handling... stopwords are not required in the mandatory part of the query but are stillused (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be)then all will be required.Supports the "boost" parameter.. like the dismax bf param, but multiplies the function query instead ofadding it inSupports pure negative nested queries... so a query like +foo (-foo) will match all documents
  • function queriestermfreq, tf, docfreq, idf, norm, maxdoc, numdocs{!func}termfreq(text,ipod)standard java.util.Math functions
  • facetingper-segment, single-valued fields: facet.method=fcs (field cache per segment) facet.field={!threads=-1}field_name threads=0: direct execution threads=-1: thread per segment speeds up single and multivalued method=fc, especially for deep paging with facet.offsetdate faceting improvements, generalized for numeric ranges toocan now exclude main query q={!tag=main}the+query&facet.field={!ex=main}category
  • pivot/grid/matrix/treefacetingis this also "hierarchical faceting"? it depends!
  • pivot faceting output/select?q=*:*&rows=0&facet=on&facet.pivot=cat,popularity,inStock&facet.pivot=popularity,cat
  • spell checkingDirectSolrSpellChecker no external index needed, uses automaton on main index
  • spellcheck configsolrconfig.xml <searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">textgen</str> <!-- a spellchecker that uses no auxiliary index --> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spell</str> <str name="classname">solr.DirectSolrSpellChecker</str> <str name="minPrefix">1</str> </lst> </searchComponent>
  • spellcheck handlersolrconfig.xml <requestHandler name="standard" class="solr.SearchHandler" default="true"> <!-- default values for query parameters --> <lst name="defaults"> <str name="echoParams">explicit</str> <str name="spellcheck">true</str> <str name="spellcheck.collate">true</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>
  • spellcheck responsehttp://localhost:8983/solr/select?q=ipud%20bluck&wt=ruby&indent=on { responseHeader=>{ status=>0, QTime=>10, params=>{ indent=>on, wt=>ruby, q=>ipud bluck}}, response=>{numFound=>0,start=>0,docs=>[] }, spellcheck=>{ suggestions=>[ ipud,{ numFound=>1, startOffset=>0, endOffset=>4, suggestion=>[ipod]}, bluck,{ numFound=>1, startOffset=>5, endOffset=>10, suggestion=>[black]}, collation,ipod black]}}
  • autosuggestnew "spellcheck" component, builds TSTcollates querycan check if collated suggestions yield results,optionally, providing hit count
  • suggest configsolrconfig.xml <searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">textgen</str> <lst name="spellchecker"> <str name="name">suggest</str> <str name="classname">org.apache.solr.spelling.suggest.Suggester</str> <str name="lookupImpl"> org.apache.solr.spelling.suggest.jaspell.JaspellLookup </str> <str name="field">suggest</str> <str name="buildOnCommit">true</str> </lst> </searchComponent>schema.xml <field name="suggest" type="textgen" indexed="true" stored="false"/> <copyField source="name" dest="suggest"/>
  • suggest handlersolrconfig.xml <requestHandler class="solr.SearchHandler" name="/suggest"> <lst name="defaults"> <str name="spellcheck">true</str> <str name="spellcheck.dictionary">suggest</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.count">10</str> <str name="rows">0</str> <str name="spellcheck.maxCollationTries">20</str> <str name="spellcheck.maxCollations">10</str> <str name="spellcheck.collateExtendedResults">true</str> </lst> <arr name="components"> <str>query</str> <!-- to allow suggestion hit counts to be returned --> <str>spellcheck</str> </arr> </requestHandler>
  • suggest responsehttp://localhost:8983/solr/suggest?q=ip&wt=ruby&indent=on { responseHeader=>{ status=>0, QTime=>2}, response=>{numFound=>0,start=>0,docs=>[] }, spellcheck=>{ suggestions=>[ ip,{ numFound=>1, startOffset=>0, endOffset=>2, suggestion=>[ipod]}, collation,[ collationQuery,ipod, hits,3, misspellingsAndCorrections,[ ip,ipod]]]}}
  • sortby function &q=*:*&sfield=store&pt=39.194564,-86.432947& sort=geodist() ascbut still cant get value of function back unless you force it to be the score somehow
  • clustering componentnow works out-of-the-box; all Apache licensecompatiblesupports distributed search
  • debug=truedebug=true|all|timing|query|resultsdebug=results&debug.explain.structured=true
  • structured explainhttp://localhost:8983/solr/select?q=title:solr&debug.explain.structured=true&debug=results&wt=ruby&indent=on debug=>{ explain=>{ doc1=>{ match=>true, value=>0.076713204, description=>fieldWeight(title:solr in 0), product of:, details=>[{ match=>true, value=>1.0, description=>tf(termFreq(title:solr)=1)}, { match=>true, value=>0.30685282, description=>idf(docFreq=1, maxDocs=1)}, { match=>true, value=>0.25, description=>fieldNorm(field=title, doc=0)}]}}}}
  • SolrCloudshared/central config and core/shard managmentvia zookeeper,built-in load balancing, and infrastructure for futureSolrCloud work.
  • /update/jsonsolrconfig.xml <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/> curl http://localhost:8983/solr/update/json?commit=true -H Content-type:application/json -d { "add": { "doc": { "id" : "MyTestDocument", "title" : "This is just a test" } } }
  • wt=csvWrites only docs (no response header or responseextras) in CSV formatRoundtrippable with /update/csv provided all fields are stored
  • UIMAUnstructured Information ManagementArchitecture http://uima.apache.org/New update processor chain, augmentingincoming documents from a UIMA annotatorpipeline http://wiki.apache.org/solr/SolrUIMA
  • (solr|lucene)-devant [idea|eclipse]go!http://wiki.apache.org/solr/HowToContribute
  • works in progresssome interesting open issues (with patches): PayloadTermQuery XMLQueryParser plugin join
  • {!join from=$f to=$t}insert <what Yonik said> https://issues.apache.org/jira/browse/ SOLR-2272
  • Lucid (imagination)Whats Lucid done for you lately - Yonik, Mark, Grant, Hoss: Lucene and Solr performance, faceting, grouping, join query, spatial, Mahout, ORP, PMC, etc, etc, etc Other technical staff involved in mailing list assistance, bug reporting, contributing patches (hi Lance, Erick, Jay, Tom, Grijesh, Tomas....) extended dismax, join, faceting performance improvements LucidWorks Enterprise
  • Hoss Simplicityhttp://www.lucidimagination.com/blog/2011/01/21/solr-powered-isfdb-part1/http://www.lucidimagination.com/blog/2011/01/28/solr-powered-isfdb-part-2/
  • LucidWorks Enterprise "lucid" query parser REST API click boosting Data sources, crawlers, and tunable norms, per- scheduling field Alerts role filtering administrative UIhttp://www.lucidimagination.com/enterprise-search-solutions/lucidworks
  • Community Questionsfire away!
  • resourcesduh!: #code4liblucene.apache.org/solrsearch.lucidimagination.com/?q=<your query>
  • Q&A: facetingwhy is paging through facets the way it is? short-circuits on enum
  • Community:- The state of Extended DisMax, and what Lucene featuresremain incompatible with it.- Any developments on faceting (Ive implemented thestandard workaround to the "unknown facet list size"problem...  but Id still love to be able to know exactly howlong the lists are)- Hierarchical documents in Solr -- I havent followed theconversations closely, but I gather that this topic is gainingsome momentum in the Solr community.
  • contact infoerik.hatcher @ lucidimagination . comhttp://www.lucidimagination.com webinars, documentation LucidFind: search.lucidimagination.com search mailing list posts, wiki pages, web sites, our blog, etc for latest Lucene/Solr assistance
  • re: code4lib