code4lib 2011 preconference: What's New in Solr (since 1.4.1)

What's New in Solr?

code4lib 2011 preconference
Bloomington, IN
presented by Erik Hatcher of Lucid Imagination

about me
spoken at several code4lib conferences

Keynoted Athens '07 along with the pioneering Solr preconference,

Providence '09, "Rising Sun"

pre-conferenced Asheville '10, "Solr Black Belt"

co-authored "Lucene in Action", ﬁrst edition; ghost/toast on second edition

Lucene and Solr committer.

library world claims to fame are founding and naming Blacklight, original developer on
Collex and the Rossetti Archive search

now at Lucid Imagination, dedicated to Lucene/Solr support/services/training/etc

abstract
The library world is ﬁred up about Solr. Practically every
next-gen catalog is using it (via Blacklight, VuFind, or other
technologies). Solr has continued improving in some
dramatic ways, including geospatial support, ﬁeld
collapsing/grouping, extended dismax query parsing, pivot/
grid/matrix/tree faceting, autosuggest, and more. This
session will cover all of these new features, showcasing
live examples of them all, including anything new that is
implemented prior to the conference.

LIA2 - Lucene in Action
Published: July 2010 - http://www.manning.com/lucene/
New in this second edition:
Performing hot backups
Using numeric ﬁelds
Tuning for indexing or searching speed
Boosting matches with payloads
Creating reusable analyzers
Adding concurrency with threads
Four new case studies, and more

Version Number
Which one ya talking 'bout, Willis?

3.1? 4.0?? TRUNK??

playing with ﬁre

index format changes to be expected

reindexing recommended/required

Solr/Lucene merged development codebases

releases should occur lock-step moving forward

dependencies

November 2009: Solr 1.4 (Lucene 2.9.1)

June 2010: Solr 1.4.1 (Lucene 2.9.3)

Spring 2011(?): Solr 3.1 (Lucene 3.1)

TRUNK: Solr 4.x (Lucene TRUNK)

lucene
per-segment ﬁeld cache, etc

Unicode and analysis improvements throughout

Analysis "attributes"

AutomatonQuery: RegexpQuery, WildcardQuery

ﬂexible indexing

and so much more!

README

Reindex!

Upgrade SolrJ libraries too (javabin format
changed)

Read Lucene and Solr's CHANGES.txt ﬁles for all
the details

Analysis

UAX, using ICU

CollationKey

PatternReplaceCharFilter

KeywordMarkerFilterFactory,
StemmerOverrideFilterFactory

Standard tokenization

ClassicTokenizer: old StandardTokenizer

StandardTokenizer: now uses Unicode text
segmentation speciﬁed by UAX#29

UAX29URLEmailTokenizer

maxTokenLength: default=255

PathHierarchyTokenizer

delimiter: default=/

replace: default=<delimiter>

"/foo/bar" => [/foo] [/foo/bar]

CollationKeyFilter
A ﬁlter that lets one specify:

A system collator associated with a locale, or

A collator based on custom rules

This can be used for changing sort order for non-english languages as well as
to modify the collation sequence for certain languages. You must use the same
CollationKeyFilter at both index-time and query-time for correct results. Also,
the JVM vendor, version (including patch version) of the slave should be exactly
same as the master (or indexer) for consistent results.

http://wiki.apache.org/solr/UnicodeCollation

see also: ICUCollationKeyFilter

ICU
International Components for Unicode

ICUFoldingFilter

ICUNormalizer2Filter

name=nfc|nfkc|nfkc_cf

mode=compose|decompose

ﬁlter

ICUFoldingFilter
Accent removal, case folding,canonical duplicates folding,dashes
folding,diacritic removal (including stroke, hook, descender), Greek letterforms
folding, Han Radical folding, Hebrew Alternates folding, Jamo folding,
Letterforms folding, Math symbol folding, Multigraph Expansions: All, Native
digit folding, No-break folding, Overline folding, Positional forms folding, Small
forms folding, Space folding, Spacing Accents folding, Subscript folding,
Superscript folding, Suzhou Numeral folding, Symbol folding, Underline folding,
Vertical forms folding, Width folding

Additionally, Default Ignorables are removed, and text is normalized to NFKC.

All foldings, case folding, and normalization mappings are applied recursively
to ensure a fully folded and normalized result.

ICUTransformFilter
id: specific transliterator identifier from ICU's
Transliterator#getAvailableIDs()(required)

direction=forward|reverse

Examples:

Traditional-Simplified: =>

Cyrillic-Latin: Российская Федерация =>
Rossijskaâ Federaciâ

Tom Burton-West's
latest

ICU

shingles

query parser

ABC -> [A] [B] [C] or [AB] [BC]...

highlighter

deprecated old conﬁg, now conﬁg as standard
search component

FastVectorHighlighter

FastVectorHighlighter

if termVectors="true", termPositions="true", and
termOffsets="true"

and hl.useFastVectorHighlighter=true

hl.fragListBuilder

hl.fragmentsBuilder

spatial
JTeam's plugin: packaged for easy deployment

Solr trunk capabilities

many distance functions

What's missing?

geo faceting? scoring by distance? distance
pseudo-ﬁeld?

All units in kilometers, unless otherwise speciﬁed

Spatial field types

Point: n-dimensional, must specify dimension
(default=2), represented by N subfields internally

LatLon: latitude,longitude, represented by two
subfields internally, single valued only

GeoHash: single string representation of lat/lon

Spatial query parsers
geofilt: exact filtering

bbox: uses (trie) range queries

Parameters:

sfield: spatial field

pt: reference point

d: distance

ﬁeld collapsing/grouping
backwards compatibility mode? sort: how to sort groups, by top
document in each group
http://wiki.apache.org/solr/
FieldCollapsing group.sort: how to sort docs within
each group
group=true
group.format: grouped | simple
group.ﬁeld / group.func / group.query
group.main=true|false:
rows / start: for groups, not documents
faceting works as normal
group.limit: number of results per
group not distributed savvy yet

group.offset: offset into doclist of each
group

query parsing

TextField: autoGeneratePhraseQueries="true"

if single string analyzes to multiple tokens

{!raw|term|field f=$f}...
Recall why we needed {!raw} from last year

<fieldType = .../> - use one string, one numeric, (and one text?)

<field name="..."/>

table for numeric and for string (and text?):

{!raw f=$f} | TermQuery(...)

{!term f=$f} | ...

{!field f=$f} | ...

Which to use when? {!raw} works for strings just fine, but best to migrate to the generally
safer/wiser {!term} for future-proofing.

{!term f=ﬁeld}

fq={!term f=weight}1.5

dismax

q.op or schema.xml's <solrQueryParser
defaultOperator="[AND|OR]"/> defaults mm to 0%
(OR) or 100% (AND)

#code4lib: issues with non-analyzed ﬁelds in qf

edismax
Supports full lucene query syntax in the absence of syntax errors

supports "and"/"or" to mean "AND"/"OR" in lucene syntax mode

When there are syntax errors, improved smart partial escaping of special characters is done to prevent
them... in this mode, ﬁelded queries, +/-, and phrase queries are still supported.

Improved proximity boosting via word bigrams... this prevents the problem of needing 100% of the words in
the document to get any boost, as well as having all of the words in a single ﬁeld.

advanced stopword handling... stopwords are not required in the mandatory part of the query but are still
used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be)
then all will be required.

Supports the "boost" parameter.. like the dismax bf param, but multiplies the function query instead of
adding it in

Supports pure negative nested queries... so a query like +foo (-foo) will match all documents

function queries

termfreq, tf, docfreq, idf, norm, maxdoc, numdocs

{!func}termfreq(text,ipod)

standard java.util.Math functions

faceting
per-segment, single-valued fields:

facet.method=fcs (field cache per segment)

facet.field={!threads=-1}field_name

threads=0: direct execution

threads=-1: thread per segment

speeds up single and multivalued method=fc, especially for deep paging with
facet.offset

date faceting improvements, generalized for numeric ranges too

can now exclude main query q={!tag=main}the+query&facet.field={!ex=main}category

pivot/grid/matrix/tree
faceting

is this also "hierarchical faceting"? it depends!

pivot faceting output
/select?q=*:*&rows=0&facet=on
&facet.pivot=cat,popularity,inStock
&facet.pivot=popularity,cat

spell checking

DirectSolrSpellChecker

no external index needed, uses automaton on
main index

spellcheck conﬁg
solrconfig.xml
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">

<str name="queryAnalyzerFieldType">textgen</str>


<lst name="spellchecker">
<str name="name">default</str>
<str name="field">spell</str>
<str name="classname">solr.DirectSolrSpellChecker</str>
<str name="minPrefix">1</str>
</lst>
</searchComponent>

spellcheck handler

solrconfig.xml
<requestHandler name="standard" class="solr.SearchHandler" default="true">

<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="spellcheck">true</str>
<str name="spellcheck.collate">true</str>
</lst>

<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>

spellcheck response
http://localhost:8983/solr/select?q=ipud%20bluck&wt=ruby&indent=on
{
'responseHeader'=>{
'status'=>0,
'QTime'=>10,
'params'=>{
'indent'=>'on',
'wt'=>'ruby',
'q'=>'ipud bluck'}},
'response'=>{'numFound'=>0,'start'=>0,'docs'=>[]
},
'spellcheck'=>{
'suggestions'=>[
'ipud',{
'numFound'=>1,
'startOffset'=>0,
'endOffset'=>4,
'suggestion'=>['ipod']},
'bluck',{
'numFound'=>1,
'startOffset'=>5,
'endOffset'=>10,
'suggestion'=>['black']},
'collation','ipod black']}}

autosuggest

new "spellcheck" component, builds TST

collates query

can check if collated suggestions yield results,
optionally, providing hit count

suggest conﬁg
solrconfig.xml
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">

<str name="queryAnalyzerFieldType">textgen</str>

<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">
org.apache.solr.spelling.suggest.jaspell.JaspellLookup
</str>
<str name="field">suggest</str>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>

schema.xml
<field name="suggest" type="textgen" indexed="true" stored="false"/>

<copyField source="name" dest="suggest"/>

suggest handler
solrconfig.xml
<requestHandler class="solr.SearchHandler" name="/suggest">
<lst name="defaults">
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggest</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.count">10</str>
<str name="rows">0</str>
<str name="spellcheck.maxCollationTries">20</str>
<str name="spellcheck.maxCollations">10</str>
<str name="spellcheck.collateExtendedResults">true</str>
</lst>
<arr name="components">
<str>query</str> 
<str>spellcheck</str>
</arr>
</requestHandler>

suggest response
http://localhost:8983/solr/suggest?q=ip&wt=ruby&indent=on
{
'responseHeader'=>{
'status'=>0,
'QTime'=>2},
'response'=>{'numFound'=>0,'start'=>0,'docs'=>[]
},
'spellcheck'=>{
'suggestions'=>[
'ip',{
'numFound'=>1,
'startOffset'=>0,
'endOffset'=>2,
'suggestion'=>['ipod']},
'collation',[
'collationQuery','ipod',
'hits',3,
'misspellingsAndCorrections',[
'ip','ipod']]]}}

sort

by function

&q=*:*&sﬁeld=store&pt=39.194564,-86.432947&
sort=geodist() asc

but still can't get value of function back

unless you force it to be the score somehow

clustering component

now works out-of-the-box; all Apache license
compatible

supports distributed search

debug=true

debug=true|all|timing|query|results

debug=results&debug.explain.structured=true

structured explain
http://localhost:8983/solr/select?q=title:solr
&debug.explain.structured=true&debug=results
&wt=ruby&indent=on
'debug'=>{
'explain'=>{
'doc1'=>{
'match'=>true,
'value'=>0.076713204,
'description'=>'fieldWeight(title:solr in 0), product of:',
'details'=>[{
'match'=>true,
'value'=>1.0,
'description'=>'tf(termFreq(title:solr)=1)'},
{
'match'=>true,
'value'=>0.30685282,
'description'=>'idf(docFreq=1, maxDocs=1)'},
{
'match'=>true,
'value'=>0.25,
'description'=>'fieldNorm(field=title, doc=0)'}]}}}}

SolrCloud

shared/central conﬁg and core/shard managment
via zookeeper,

built-in load balancing, and infrastructure for future
SolrCloud work.

/update/json
solrconfig.xml
<requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/>

curl
'http://localhost:8983/solr/update/json?commit=true'
-H 'Content-type:application/json' -d '
{
"add": {
"doc": {
"id" : "MyTestDocument",
"title" : "This is just a test"
}
}
}'

wt=csv

Writes only docs (no response header or response
extras) in CSV format

Roundtrippable with /update/csv

provided all ﬁelds are stored

UIMA
Unstructured Information Management
Architecture

http://uima.apache.org/

New update processor chain, augmenting
incoming documents from a UIMA annotator
pipeline

http://wiki.apache.org/solr/SolrUIMA

(solr|lucene)-dev

ant [idea|eclipse]

go!

http://wiki.apache.org/solr/HowToContribute

works in progress

some interesting open issues (with patches):

PayloadTermQuery

XMLQueryParser plugin

join

{!join from=$f to=$t}

insert <what Yonik said>

https://issues.apache.org/jira/browse/
SOLR-2272

Lucid (imagination)
What's Lucid done for you lately -

Yonik, Mark, Grant, Hoss: Lucene and Solr performance,
faceting, grouping, join query, spatial, Mahout, ORP, PMC,
etc, etc, etc

Other technical staff involved in mailing list assistance, bug
reporting, contributing patches (hi Lance, Erick, Jay, Tom,
Grijesh, Tomas....)

extended dismax, join, faceting performance improvements

LucidWorks Enterprise

Hoss Simplicity

http://www.lucidimagination.com/blog/
2011/01/21/solr-powered-isfdb-part1/

http://www.lucidimagination.com/blog/
2011/01/28/solr-powered-isfdb-part-2/

LucidWorks Enterprise
"lucid" query parser REST API

click boosting Data sources,
crawlers, and
tunable norms, per- scheduling
ﬁeld
Alerts
role ﬁltering

administrative UI

http://www.lucidimagination.com/enterprise-search-solutions/lucidworks

Community Questions

ﬁre away!

resources

duh!: #code4lib

lucene.apache.org/solr

search.lucidimagination.com/?q=<your query>

Q&A: faceting

why is paging through facets the way it is?

short-circuits on enum

Community:
- The state of Extended DisMax, and what Lucene features
remain incompatible with it.

- Any developments on faceting (I've implemented the
standard workaround to the "unknown facet list size"
problem... but I'd still love to be able to know exactly how
long the lists are)

- Hierarchical documents in Solr -- I haven't followed the
conversations closely, but I gather that this topic is gaining
some momentum in the Solr community.

contact info
erik.hatcher @ lucidimagination . com

http://www.lucidimagination.com

webinars, documentation

LucidFind: search.lucidimagination.com

search mailing list posts, wiki pages, web
sites, our blog, etc for latest Lucene/Solr
assistance

code4lib 2011 preconference: What's New in Solr (since 1.4.1)

More Related Content

What's hot

Similar to code4lib 2011 preconference: What's New in Solr (since 1.4.1)

More from Erik Hatcher

Recently uploaded

code4lib 2011 preconference: What's New in Solr (since 1.4.1)