Lucene Roadmap
Steve Rowe
LucidWorks
• 1997: Doug Cutting creates Lucene
• 2000-2001: SourceForge hosts Lucene
• 2001-present: Lucene @ Apache Software Foundation
• 2006: Flexible indexing planning starts
• 2007: Solr graduates from the Apache Incubator to join the Lucene PMC as a sub-project
• 2008: Flexible indexing implementation begins
• 2010: Lucene and Solr development merge
• 2011: Lucene and Solr 3.1 and all further releases coordinated (13 joint releases so far)
• 2012: Lucene/Solr 4.0 released
Some Lucene (& Solr) History & Stats
Lucene 4.0 Highlights
• Flexible indexing: pluggable codecs: index format suites
• Flexible scoring: more index stats & similarities that use them
• Faster multithreaded indexing via concurrent flushing: DWPT
• Doc Values: typed single-valued fields: flexible sorting, scoring
• Norms are now doc values: you can have more than one byte!
• More RAM efficient data structures, e.g. terms dict/idx & fieldcache
• Faster search filtering
• Merge I/O can be rate-limited, to reduce I/O contention
• IndexReader is now per-segment
• Completely reworked spatial search
Lucene 4.1 & 4.2 Highlights
• Seeks on writing out index files eliminated
• Compressed stored fields and term vectors
• AnalyzingSuggester and FuzzySuggester
• Lucene facet module improvements: speedups, NRT
support, DrillSideways
• PostingsHighlighter: uses postings offsets
• CommonTermsQuery: speed up queries with very highly
frequent terms.
• Doc Values API and performance improvements
• The FST package supports FSTs over 2GB in size
• LiveFieldValues: real-time get for Lucene
• New classification module
Lucene 4.3 Highlights
• minShouldMatch BooleanQuery major performance
improvement
• SortingAtomicReader and SortingMergePolicy
• DocIdSetIterator and Scorer now has a cost API
• Analyzing/FuzzySuggester now enable recording an
arbitrary byte[] as a payload
• Spatial module: support for query relations Within,
Contains, and Disjoint
• Facet module: new method computes facet counts
using SortedSetDocValuesField, without a separate
taxonomy index.
On the horizon
• More efficient positional queries
• Incremental field updates
• Korean Analyzer
Solr Dev/User Survey Results
Solr Developer/User survey, April 2013
• Survey invitation emailed to 4,136 people:
– LucidWorks training class attendees
– Revolution attendees
– LucidWorks webinar registrants
• 177 have responded so far
Please rank the following features by priority
Answered: 165 Skipped: 12
More questions
1. How many attendees are Eclipse developers?
2. How many attendees are running Solr Cloud
in production?
Solr: Past, Present & Future
Yonik Seeley
LucidWorks
Origins of Solr
• CNET driven to find alternatives to discontinued
commercial enterprise search product
• Plan A: ATOMICS (Apache TO MySQL In CNET
Search)
– Standalone server speaking XML over HTTP
– Meet majority of “search” needs
– http://conferences.oreillynet.com/cs/mysqluc2005/view/e_sess/7066
• Plan B: “Something based on Lucene”
– Started Summer 2004
– First prototype called “Fusion”, later renamed SOLAR
(Search On Lucene And Resin)
Origins of the first Solr admin UI
New admin UI
Timeline
(up to 1.4)
Initial
prototype
CNET
production
CNET
contributes
Solr to ASF
Solr
graduates
from
Incubator
Simple
faceting
replication
highlighting,
dismax
Spellchecking
, CSV, Luke
MLT, Update
Request
Processors
QParsers Search
Components
Multi-core
Distributed
Search
Data Import
Handler
JMX
1.3
1.4
Statistics
Component
Java
Replication
Terms and
TermVector
Components
Multi-select
faceting
Dynamic
Clustering
1.1
1.0
1.2
4.0
3.1
Solr 4
• Solr Cloud
– Distributed Indexing
– No single points of failure
– Near Real Time friendly (push replication)
• NoSQL feature set
– Update Durability
– Real-time get
– Atomic Updates
– Optimistic Concurrency
• Pseudo-join, Pivot Faceting, Pseudo-fields, etc
What search solution/version are you
currently using?
Recent Enhancements
Document Routing
80000000-bfffffff
00000000-3fffffff
40000000-
7fffffff
c0000000-ffffffff
shard1shard4
shard3 shard2
id = BigCo!doc5
1f27 3c71
(MurmurHash3)
q=my_query
shard.keys=BigCo!
1f27 0000 1f27 ffffto
(hash)
shard1
numShards=4
router=compositeId
Seamless Online Shard Splitting
Shard2_0
Shard1
replic
a
leader
Shard2
replic
a
leader
Shard3
replic
a
leader
Shard2_1
1. New sub-shards created in “construction” state
2. Leader starts forwarding applicable updates, which
are buffered by the sub-shards
3. Leader index is split and installed on the sub-shards
4. Sub-shards apply buffered updates then become
“active” leaders and old shard becomes “inactive”
update
Cloud Enhancements
• Request forwarding
– In a multi-collection cluster, any node can
handle/forward requests for any collection
• Collection Aliases
http://localhost:8983/solr/admin/collections
?action=CREATEALIAS
&name=northeast
&collections=NY,NJ,PA,CT,ME,MA,NH,RI,VT
• Coming Soon: Shard Aliases
Schema REST API
• Restlet is now integrated with Solr
• Get a specific field
curl http://localhost:8983/solr/schema/fields/price
{"field":{
"name":"price",
"type":"float",
"indexed":true,
"stored":true }}
• Get all fields
curl http://localhost:8983/solr/schema/fields
• Get Entire Schema!
curl http://localhost:8983/solr/schema
Dynamic Schema
• Add a new field (Solr 4.4)
curl -XPUT http://localhost:8983/solr/schema/fields/strength -d ‘
{"type":”float", "indexed":"true”}
‘
• Works in distributed (cloud) mode too!
• Future: More schemaless
– Reality: there is no such thing for Lucene based systems
– Type guessing for fields we haven’t seen before
Future
• Greater scalability
• More “NoSQL”
– More ways to update & manipulate documents
• Analytics
– More powerful faceting, functions, statistics
• Improved Relational queries
• More dynamic (settings & configuration)
• Continued focus on ease of use
Thank You!

KEYNOTE: Lucene / Solr road map

  • 1.
  • 2.
    • 1997: DougCutting creates Lucene • 2000-2001: SourceForge hosts Lucene • 2001-present: Lucene @ Apache Software Foundation • 2006: Flexible indexing planning starts • 2007: Solr graduates from the Apache Incubator to join the Lucene PMC as a sub-project • 2008: Flexible indexing implementation begins • 2010: Lucene and Solr development merge • 2011: Lucene and Solr 3.1 and all further releases coordinated (13 joint releases so far) • 2012: Lucene/Solr 4.0 released Some Lucene (& Solr) History & Stats
  • 3.
    Lucene 4.0 Highlights •Flexible indexing: pluggable codecs: index format suites • Flexible scoring: more index stats & similarities that use them • Faster multithreaded indexing via concurrent flushing: DWPT • Doc Values: typed single-valued fields: flexible sorting, scoring • Norms are now doc values: you can have more than one byte! • More RAM efficient data structures, e.g. terms dict/idx & fieldcache • Faster search filtering • Merge I/O can be rate-limited, to reduce I/O contention • IndexReader is now per-segment • Completely reworked spatial search
  • 4.
    Lucene 4.1 &4.2 Highlights • Seeks on writing out index files eliminated • Compressed stored fields and term vectors • AnalyzingSuggester and FuzzySuggester • Lucene facet module improvements: speedups, NRT support, DrillSideways • PostingsHighlighter: uses postings offsets • CommonTermsQuery: speed up queries with very highly frequent terms. • Doc Values API and performance improvements • The FST package supports FSTs over 2GB in size • LiveFieldValues: real-time get for Lucene • New classification module
  • 5.
    Lucene 4.3 Highlights •minShouldMatch BooleanQuery major performance improvement • SortingAtomicReader and SortingMergePolicy • DocIdSetIterator and Scorer now has a cost API • Analyzing/FuzzySuggester now enable recording an arbitrary byte[] as a payload • Spatial module: support for query relations Within, Contains, and Disjoint • Facet module: new method computes facet counts using SortedSetDocValuesField, without a separate taxonomy index.
  • 6.
    On the horizon •More efficient positional queries • Incremental field updates • Korean Analyzer
  • 7.
  • 8.
    Solr Developer/User survey,April 2013 • Survey invitation emailed to 4,136 people: – LucidWorks training class attendees – Revolution attendees – LucidWorks webinar registrants • 177 have responded so far
  • 9.
    Please rank thefollowing features by priority Answered: 165 Skipped: 12
  • 15.
    More questions 1. Howmany attendees are Eclipse developers? 2. How many attendees are running Solr Cloud in production?
  • 16.
    Solr: Past, Present& Future Yonik Seeley LucidWorks
  • 17.
    Origins of Solr •CNET driven to find alternatives to discontinued commercial enterprise search product • Plan A: ATOMICS (Apache TO MySQL In CNET Search) – Standalone server speaking XML over HTTP – Meet majority of “search” needs – http://conferences.oreillynet.com/cs/mysqluc2005/view/e_sess/7066 • Plan B: “Something based on Lucene” – Started Summer 2004 – First prototype called “Fusion”, later renamed SOLAR (Search On Lucene And Resin)
  • 18.
    Origins of thefirst Solr admin UI
  • 19.
  • 20.
    Timeline (up to 1.4) Initial prototype CNET production CNET contributes Solrto ASF Solr graduates from Incubator Simple faceting replication highlighting, dismax Spellchecking , CSV, Luke MLT, Update Request Processors QParsers Search Components Multi-core Distributed Search Data Import Handler JMX 1.3 1.4 Statistics Component Java Replication Terms and TermVector Components Multi-select faceting Dynamic Clustering 1.1 1.0 1.2 4.0 3.1
  • 21.
    Solr 4 • SolrCloud – Distributed Indexing – No single points of failure – Near Real Time friendly (push replication) • NoSQL feature set – Update Durability – Real-time get – Atomic Updates – Optimistic Concurrency • Pseudo-join, Pivot Faceting, Pseudo-fields, etc
  • 22.
    What search solution/versionare you currently using?
  • 23.
  • 24.
    Document Routing 80000000-bfffffff 00000000-3fffffff 40000000- 7fffffff c0000000-ffffffff shard1shard4 shard3 shard2 id= BigCo!doc5 1f27 3c71 (MurmurHash3) q=my_query shard.keys=BigCo! 1f27 0000 1f27 ffffto (hash) shard1 numShards=4 router=compositeId
  • 25.
    Seamless Online ShardSplitting Shard2_0 Shard1 replic a leader Shard2 replic a leader Shard3 replic a leader Shard2_1 1. New sub-shards created in “construction” state 2. Leader starts forwarding applicable updates, which are buffered by the sub-shards 3. Leader index is split and installed on the sub-shards 4. Sub-shards apply buffered updates then become “active” leaders and old shard becomes “inactive” update
  • 26.
    Cloud Enhancements • Requestforwarding – In a multi-collection cluster, any node can handle/forward requests for any collection • Collection Aliases http://localhost:8983/solr/admin/collections ?action=CREATEALIAS &name=northeast &collections=NY,NJ,PA,CT,ME,MA,NH,RI,VT • Coming Soon: Shard Aliases
  • 27.
    Schema REST API •Restlet is now integrated with Solr • Get a specific field curl http://localhost:8983/solr/schema/fields/price {"field":{ "name":"price", "type":"float", "indexed":true, "stored":true }} • Get all fields curl http://localhost:8983/solr/schema/fields • Get Entire Schema! curl http://localhost:8983/solr/schema
  • 28.
    Dynamic Schema • Adda new field (Solr 4.4) curl -XPUT http://localhost:8983/solr/schema/fields/strength -d ‘ {"type":”float", "indexed":"true”} ‘ • Works in distributed (cloud) mode too! • Future: More schemaless – Reality: there is no such thing for Lucene based systems – Type guessing for fields we haven’t seen before
  • 29.
    Future • Greater scalability •More “NoSQL” – More ways to update & manipulate documents • Analytics – More powerful faceting, functions, statistics • Improved Relational queries • More dynamic (settings & configuration) • Continued focus on ease of use
  • 30.