What's new in solr june 2014

1
What’s New in Solr
Solr 4.7 & 4.8
June 12, 2014
Search | Discover | Analyze

Speaker
• Software Engineer at LucidWorks
• Lucene/Solr committer and PMC member
• Previously worked on search and NLP at the
Center for Natural Language Processing at
Syracuse University’s iSchool
• Twitter: @steven_a_rowe
Steve Rowe
2

Agenda
• A short history of Solr 4
• Solr 4.7 and 4.8: new features
• Solr 4.9 and beyond
3

A short history of Solr 4
• Solr 4.0 released October 2012
4

• SolrCloud
– Distributed indexing and searching, NRT and NoSQL
features, e.g. realtime-get, optimistic concurrency and
durable updates
– Sharding, replication, ZooKeeper ensemble
– High availability with no single points of failure
• Real-time Get: Access latest document version, no
commit or new searcher open required
• Atomic updates: incremental field
add/update/increment via stored fields
• NRT: “soft” commits
5

• Solr Reference Guide now released with each
feature release:
– Live (targeting next Solr release):
http://s.apache.org/SolrReferenceGuide
– Most recent released PDF:
http://s.apache.org/Solr-Ref-Guide-PDF
– Previous release PDFs:
http://s.apache.org/Older-Solr-Ref-Guide-PDFs
6

• Flexible indexing
– Solr core = Lucene index
• Lucene index = 1 or more segments
– Codec: per-segment suite of formats
• Flexible scoring
– You can specify similarity implementation per fieldType in
your schema.xml if you use SchemaSimilarityFactory
– Built-in Similarities (other than the default TF-IDF):
• Okapi BM25
• Divergence from Randomness
• Information-Based
• Language Models (with two smoothing implementations)
• SweetSpot
7

• DocValues: typed column stride fields
– Document-to-value mapping built at index time
– Reduced memory usage compared to field cache
– Good for faceting and sorting
– Missing values now supported as of Solr 4.5
• Pseudo-fields
– Field aliasing, e.g. &fl=result:indexed
– Function queries, aliasable too, e.g. &fl=price:sum(a,b)
– Document transformers
• Standard: [explain], [value], [shard], [docid]
• Pseudo-joins, e.g. ?q={!join+from=manu+to=id}ipod
• Pivot faceting: automatic drill-down (no distr.’d support)
8

• Schema API
• GET /collection/schema/fields/fieldname
• PUT /collection/schema/fields/name
• JSON body: { "type":"text_general",
"stored":true,
"indexed":true }
• Schemaless mode
• a.k.a. data-driven schema or field guessing
• Class guessed based on field values, then class(es)
mapped to a fieldType; first gets added to the schema
• Supported value classes: Boolean, Integer, Long, Float,
Double, and Date
9

• Document routing
– CompositeId router, e.g. id=tenant!docid
• Used by default when numShards specified when
creating a collection.
• Restrict queries to shard(s): &_route_=tenant!
– Implicit router
• Online shard splitting
– Allows collections to scale, rather than having to
decide on how much to overshard up front.
– Split in two; with custom hash ranges; or using
split.key param to split to a dedicated shard
10

• Nested documents, a.k.a. Block Join
– Nested doc to be added:
<add>
<doc>
<field name="id">1</field>
<field name="title">Solr adds block join support</field>
<field name="content_type">parentDocument</field>
<doc>
<field name="id">2</field>
<field name="comments">SolrCloud supports it too!</field>
</doc>
</doc>
</add>
– Queries:
• Child query parser, e.g.
q={!child of="content_type:parentDocument"}title:Solr
• Parent query parser, e.g.
q={!parent which="content_type:parentDocument"}comments:SolrCloud
11

• solr.xml legacy & discovery modes
– Legacy mode (cores listed in solr.xml) is
deprecated; support will be removed in Solr 5.
– Discovery mode (new as of Solr 4.3):
• No cores are listed in solr.xml
• Cores are discovered by a recursive walk of the solr
home directory, marked by core.properties files
• Nested core directories are not allowed
12

• New web admin UI with SolrCloud support
13

Solr 4.7 and 4.8: new features
• As of Solr 4.8, Java 7 is the minimum supported
JVM version. Recommended: Oracle 1.7.0_60
• <fields> and <types> tags are no longer necessary in
schema.xml
• Collections API improvements
– Working toward “ZooKeeper = Truth” mode
• legacyCloud=false cluster property
– New actions:
• CLUSTERSTATUS, LIST, ADDROLE, DELETEROLE,
ADDREPLICA, DELETEREPLICA, OVERSEERSTATUS,
MIGRATE, CLUSTERPROP
– Core properties can be specified with CREATE and
SPLITSHARD actions
14

• Asynchronous execution of long-running
actions
– SolrCloud Collections API:
• CREATE, SPLITSHARD, MIGRATE
– CoreAdminHandler:
• CREATE, RENAME, UNLOAD, SWAP, MERGEINDEXES,
SPLIT
– Tracking request ID supplied via async param
– Track status via the new REQUESTSTATUS action,
using the tracking request ID
• Possible states: running, complete, failed, notfound
– Clear stored statuses with special request ID -1
15

• Cursors: Efficient Deep Paging
– Request must include a sort, which must include
the uniqueKey, which must be defined
– First page: ?q=…&sort=id+asc&rows=N&cursorMark=*
• Response contains "nextCursorMark":"<base64encoded>"
– Following pages:
?q=…&sort=id+asc&rows=N&cursorMark=<from response>
– Repeat; when nextCursorMark=cursorMark from the
request, there are no more results
– No server-side state
16

17

• Document expiration and Time To Live (TTL)
– Auto-delete expired documents
• DocExpirationUpdateProcessorFactory can periodically
wake up and delete expired documents
– Compute expiration date from TTL
• Update request _ttl_ param, or
• Document _ttl_ field
• Both names are configurable, defaulting to _ttl_.
• _ttl_ values are interpreted as Date Math Expressions
relative to NOW, e.g. “+1YEAR”.
18

• Dynamic synonyms and stopwords
– “Managed” resources: configuration and content for
synonyms and stopwords, persistence managed by Solr
– Specified as ManagedSynonymFilterFactory and
ManagedStopFilterFactory on analyzers in schema.xml
– CRUD operations are enabled via a REST endpoint per
managed resource.
– The “managed” attribute names the REST endpoint, e.g.
<filter class="solr.ManagedStopFilterFactory"
managed="french" />
– E.g. to delete stopword “le” from the “french” managed
stoplist:
curl -X DELETE "…/solr/colln/schema/analysis/stopwords/french/le"
19

• SSL support in SolrCloud
– URL scheme stored in ZooKeeper
– SSL certificates are specifiable via system properties, to
enable authentication
• Nested documents may be specified in JSON format
• Tri-level compositeId routing
– E.g. “tenant!group!docid”, 8/8/16 hash bits per component
• Build Solr indexes with Hadoop’s MapReduce
– +Mark Miller’s blog: http://bit.ly/1oh0fWq
• Github solr-map-reduce-example: http://bit.ly/1pnDAao
• Named config sets in non-SolrCloud mode
– Default base directory is SOLR_HOME/configsets/
20

• Suggester v2
– Added BlendedInfixSuggester
– Added FreeTextSuggester
– Queries can use multiple suggesters
• New query parsing features
– SimpleQParserPlugin: parser for human entered
queries with selectable operators.
– ComplexPhraseQParserPlugin: wildcards, ORs, etc.
inside Phrase Queries
• E.g. {!complexphrase inOrder=true}name:"Jo* Smith"
21

• CollapsingQParserPlugin
– Performant alternative grouping/field collapsing
implementation, for high distinct group cardinality.
• ExpandComponent
– Expands collapsed groups
– Can also expand nested documents
22

Solr 4.9 and beyond
• ZooKeeper = Truth / legacyCloud=false
• MODIFYCOLLECTION collections API
– Modify maxShardsPerNode, replicationFactor for the
entire collection
• Incremental Field Updates on numeric
DocValues
– Binary DocValues IFUs also coming
• Multi-valued DocValues sort fields
• Legacy numeric/date field types deprecated,
removed in Solr 5 in favor of Trie field types
23

Solr 4.9 and beyond
• In Solr 5, the .war will no longer be shipped
• Index integrity: checksums
• Integrity check on merge off by default
• solrconfig.xml option <indexConfig><checkIntegrityAtMerge>
• New update query param min_rf will allow clients
to set the minimum successful replicas for the
request
• Return Block Join child documents when parents
match, via a new DocTransformer
[child parentFilter=“field:value”]
24

Solr 4.9 and beyond
• AnalyticsQuery: support pluggable, pipeline-able
analytics, orderable via the “cost” parameter, like
PostFilters.
• ReRankingQParserPlugin
• Re-rank the top n results
25

Platform
LucidWorks Open Source
26
• Effortless AWS deployment and monitoring:
http://www.github.com/lucidworks/solr-scale-tk
• Logstash for Solr:
https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr):
https://github.com/LucidWorks/banana
• Data Quality Toolkit: https://github.com/LucidWorks/data-
quality
• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way
support w/ Lucene and Solr, different file formats, pipelines,
Logstash

Links
Solr website: http://lucene.apache.org/solr
Solr Reference Guide:
• Live (targeting next Solr release):
http://s.apache.org/SolrReferenceGuide
• Most recent released PDF: http://s.apache.org/Solr-Ref-Guide-
PDF
• Previous release PDFs: http://s.apache.org/Older-Solr-Ref-
Guide-PDFs
Lucene/Solr Revolution: http://www.LuceneRevolution.org
Q & A
27

What's new in solr june 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to What's new in solr june 2014

Similar to What's new in solr june 2014 (20)

More from Lucidworks (Archived)

More from Lucidworks (Archived) (20)

Recently uploaded

Recently uploaded (20)

What's new in solr june 2014

Editor's Notes