What’s new in Lucene and Solr?
Grant Ingersoll
CTO, LucidWorks
Lucene/Solr Committer
Sink or Swim?
Search is good for…
• Traditional: Fast, fuzzy text matching across a large document
collection
• De-normalized data
– “light” relational
• Top N problems
– Key-value (top 1)
– Recommendations, “Good enough” classification, clustering
• Faceting, slicing and dicing of numerical/enumerated data
• Spatial, spell checking, record linkage, highlighting
• NoSQL
What’s New?
• Community
• Lucene
• Solr
Relax, You’re Among Friends
• Large, diverse search community with many non-traditional search
engine usages
– Object stores, Record linkage, Social, mobile -> web
• “The Apache Way”
– Meritocracy – Those who do, decide!
• Always Be Testing
– Randomized system tests are all the rage
– http://vimeo.com/32087114
• Patches Welcome!
Acceleration!
Coming Soon: Lucene and Solr 4.8
Java 1.7
Lucene: Speed and Memory
• Native Near Real Time (NRT) support
– Per segment
– FieldCache can be controlled to only load new segments
– Soft commit -- faster without fsync, allows quicker update visibility
• DWPT (Document Writer per Thread)
– Faster more consistent index speed
• Faster fuzzy & wildcard query processing
• Automatic compression of stored fields and term vectors
• String -> BytesRef
– Much improved data structure
– … means less memory and less garbage collection effort
Lucene: Flexibility
• Flexible Index Formats
– New posting list codecs: Block, Simple Text, HDFS, etc.
– Pulsing codec: improves performance of primary key searches, inlining
docs, positions, and payloads, saves disk seeks
• Pluggable Scoring
– Decoupled from TF/IDF
– Built in alternatives include BM25 & DFR, and others
• http://en.wikipedia.org/wiki/Okapi_BM25
• http://terrier.org/docs/v3.5/dfr_description.html
– Add your own
FS(A|T)
• Keys:
– byte[] – write-once
– Linear time build of min. automata
– Compression, Reverse lookups
– Weights (used for auto-suggest)
– Pluggable Algebra
• Uses:
– Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others
– FuzzyQuery is 100x faster -- http://bit.ly/hgO65c
• More:
– http://slidesha.re/vKtpVA, http://bit.ly/Pkjyu0
– “Smaller Representation of Finite State Automata”
• Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807,
2011, pp. 118—192.
Grab Bag
• Lots of new suggesters
– Available in Solr
• Doc Values
– Column oriented store
– Numeric and binary variants are updatable (coming to Solr soon)
• Overhauled term vectors APIs
– Now look a lot like Terms
Solr 4: New Features
• Search/Faceting/Relevance
– New Relevance Function Queries (tf, df, others)
– Pivot Faceting
– Pseudo-join
– Improved Spatial (more later)
– Full support for Lucene Codecs, pluggable scoring
• Indexing
– New Update Processors, including scripting option
– Near real time
• Schema and Config APIs + Schemaless
• Cursors (aka Deep Paging)
• Admin UI
Geospatial improvements
• Index shapes other than points (circles, polygons, etc)
• More complex interactions than point in a circle
• Indexing:
– "geo”:”43.17614,-90.57341”
– “geo”:”Circle(4.56,1.23 d=0.0710)”
– “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”
• Searching:
– fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"
– fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10
30)))”
Scaling Solr
• Distributed/sharded indexing & search
– Auto distributes updates and queries to appropriate shards
– Near Real Time (NRT) indexing capable
– Document routing extensions
• Dynamically scalable
– New SolrCloud instances add indexing and query capacity
– Supports re-balancing (shard-splitting)
• Reliable
– No single point of failure
– Transactions logged
– Robust, automatic recover
• http://wiki.apache.org/solr/SolrCloud
Solr as NoSQL
• Non-traditional data stores
• Not designed for SQL type queries
• Distributed fault tolerant architecture
• Document oriented, data format agnostic (JSON, XML, CSV, binary)
Go Deep!
APIs
• New APIs for Schema and Solr Config
– XML becoming more of an implementation detail
• Managed Schema mode
• Data-driven schema (aka schemaless)
• Synonyms, stopwords, request handlers
Beyond Solr: LucidWorks Open Source
• Effortless AWS deployment and monitoring:
http://www.github.com/lucidworks/solr-scale-tk
• Logstash for Solr: https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr): https://github.com/LucidWorks/banana
• Data Quality Toolkit: https://github.com/LucidWorks/data-quality
• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/
Lucene and Solr, different file formats, pipelines, Logstash
Summary
• Lucene/Solr 4.x:
– Faster
– More Flexible
– Easier than ever scaling
– More reliable than ever
• Go forth and rank!
Resources
• Me
– grant@lucidworks.com
– @gsingers on Twitter
• LucidWorks
– http://www.lucidworks.com
– http://www.lucidworks.com/support-services/ask-the-experts/

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

  • 1.
    What’s new inLucene and Solr? Grant Ingersoll CTO, LucidWorks Lucene/Solr Committer
  • 2.
  • 3.
    Search is goodfor… • Traditional: Fast, fuzzy text matching across a large document collection • De-normalized data – “light” relational • Top N problems – Key-value (top 1) – Recommendations, “Good enough” classification, clustering • Faceting, slicing and dicing of numerical/enumerated data • Spatial, spell checking, record linkage, highlighting • NoSQL
  • 4.
  • 5.
    Relax, You’re AmongFriends • Large, diverse search community with many non-traditional search engine usages – Object stores, Record linkage, Social, mobile -> web • “The Apache Way” – Meritocracy – Those who do, decide! • Always Be Testing – Randomized system tests are all the rage – http://vimeo.com/32087114 • Patches Welcome!
  • 6.
  • 7.
    Coming Soon: Luceneand Solr 4.8 Java 1.7
  • 9.
    Lucene: Speed andMemory • Native Near Real Time (NRT) support – Per segment – FieldCache can be controlled to only load new segments – Soft commit -- faster without fsync, allows quicker update visibility • DWPT (Document Writer per Thread) – Faster more consistent index speed • Faster fuzzy & wildcard query processing • Automatic compression of stored fields and term vectors • String -> BytesRef – Much improved data structure – … means less memory and less garbage collection effort
  • 10.
    Lucene: Flexibility • FlexibleIndex Formats – New posting list codecs: Block, Simple Text, HDFS, etc. – Pulsing codec: improves performance of primary key searches, inlining docs, positions, and payloads, saves disk seeks • Pluggable Scoring – Decoupled from TF/IDF – Built in alternatives include BM25 & DFR, and others • http://en.wikipedia.org/wiki/Okapi_BM25 • http://terrier.org/docs/v3.5/dfr_description.html – Add your own
  • 11.
    FS(A|T) • Keys: – byte[]– write-once – Linear time build of min. automata – Compression, Reverse lookups – Weights (used for auto-suggest) – Pluggable Algebra • Uses: – Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others – FuzzyQuery is 100x faster -- http://bit.ly/hgO65c • More: – http://slidesha.re/vKtpVA, http://bit.ly/Pkjyu0 – “Smaller Representation of Finite State Automata” • Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807, 2011, pp. 118—192.
  • 12.
    Grab Bag • Lotsof new suggesters – Available in Solr • Doc Values – Column oriented store – Numeric and binary variants are updatable (coming to Solr soon) • Overhauled term vectors APIs – Now look a lot like Terms
  • 14.
    Solr 4: NewFeatures • Search/Faceting/Relevance – New Relevance Function Queries (tf, df, others) – Pivot Faceting – Pseudo-join – Improved Spatial (more later) – Full support for Lucene Codecs, pluggable scoring • Indexing – New Update Processors, including scripting option – Near real time • Schema and Config APIs + Schemaless • Cursors (aka Deep Paging) • Admin UI
  • 15.
    Geospatial improvements • Indexshapes other than points (circles, polygons, etc) • More complex interactions than point in a circle • Indexing: – "geo”:”43.17614,-90.57341” – “geo”:”Circle(4.56,1.23 d=0.0710)” – “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))” • Searching: – fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)" – fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))”
  • 16.
    Scaling Solr • Distributed/shardedindexing & search – Auto distributes updates and queries to appropriate shards – Near Real Time (NRT) indexing capable – Document routing extensions • Dynamically scalable – New SolrCloud instances add indexing and query capacity – Supports re-balancing (shard-splitting) • Reliable – No single point of failure – Transactions logged – Robust, automatic recover • http://wiki.apache.org/solr/SolrCloud
  • 17.
    Solr as NoSQL •Non-traditional data stores • Not designed for SQL type queries • Distributed fault tolerant architecture • Document oriented, data format agnostic (JSON, XML, CSV, binary)
  • 18.
  • 19.
    APIs • New APIsfor Schema and Solr Config – XML becoming more of an implementation detail • Managed Schema mode • Data-driven schema (aka schemaless) • Synonyms, stopwords, request handlers
  • 20.
    Beyond Solr: LucidWorksOpen Source • Effortless AWS deployment and monitoring: http://www.github.com/lucidworks/solr-scale-tk • Logstash for Solr: https://github.com/LucidWorks/solrlogmanager • Banana (Kibana for Solr): https://github.com/LucidWorks/banana • Data Quality Toolkit: https://github.com/LucidWorks/data-quality • Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash
  • 21.
    Summary • Lucene/Solr 4.x: –Faster – More Flexible – Easier than ever scaling – More reliable than ever • Go forth and rank!
  • 22.
    Resources • Me – grant@lucidworks.com –@gsingers on Twitter • LucidWorks – http://www.lucidworks.com – http://www.lucidworks.com/support-services/ask-the-experts/