What's New in Solr 3.x / 4.0


Published on

Published in: Technology

What's New in Solr 3.x / 4.0

  1. 1. What’s New in Solr 3.x/4.0 Charlottesville Lucene/Solr Meetup August 15, 2011 Erik Hatcher Lucid Imagination
  2. 2. What is Solr?• Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the worlds largest internet sites.
  3. 3. What is Lucene?• Apache Lucene is a high-performance, full- featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full- text search, especially cross-platform.
  4. 4. Solr History• November 2009: Solr 1.4 (Lucene 2.9.1)• June 2010: Solr 1.4.1 (Lucene 2.9.3)• 2011 • March - Solr 3.1 • May - Solr 3.2 • July - Solr 3.3
  5. 5. Solr 3.1• Improved geospatial support • New autosuggest component• Sorting by function queries • Distributed support for more components• Range faceting on all numeric fields • JSON document indexing and CSV response format• Example Velocity driven search UI at http://localhost:8983/solr/browse • Apache UIMA integration for metadata extraction• A new termvector-based highlighter• Improved spellchecking capabilities • Many other Bugfixes, improvements and optimizations• Improved integration with Apache Lucene
  6. 6. Major components• Apache Lucene 3.1.0• Apache Tika 0.8• Carrot2 3.4.2• Velocity 1.6.1 and Velocity Tools 2.0-beta3• Apache UIMA 2.3.1-SNAPSHOT
  7. 7. Schema / Config• SOLR-1131: FieldTypes can now output multiple Fields per Type and still be searched. This can be handy for hiding the details of a particular implementation such as in the spatial case.• SOLR-1379: Add RAMDirectoryFactory for non- persistent in memory index storage.• SOLR-2059: Add "types" attribute to WordDelimiterFilterFactory, which allows you to customize how WordDelimiterFilter tokenizes text with a configuration file.
  8. 8. Indexing• SOLR-945: JSON update handler that accepts add, delete, commit commands in JSON format.
  9. 9. Geospatial• SOLR-1302: Added several new distance based functions, including Great Circle (haversine), Manhattan, Euclidean and String (using the StringDistance methods in the Lucene spellchecker). Also added geohash(), deg() and rad() convenience functions. See http://wiki.apache.org/solr/ FunctionQuery• SOLR-1568: Added "native" filtering support for PointType, GeohashField. Added LatLonType with filtering support too. See http://wiki.apache.org/solr/SpatialSearch and the example. Refactored some items in Lucene spatial. Removed SpatialTileField as the underlying CartesianTier is broken beyond repair and is going to be moved.
  10. 10. Query Parsing• SOLR-1553: New dismax parser implementation (accessible as "edismax") that supports full lucene syntax, improved reserved char escaping, fielded queries, improved proximity boosting, and improved stopword handling. Note: status is experimental for now.• SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField. autoGeneratePhraseQueries="true" (the default) causes the query parser to generate phrase queries if multiple tokens are generated from a single non-quoted analysis string. For example WordDelimiterFilter splitting text:pdp-11 will cause the parser to generate text:"pdp 11" rather than (text:PDP OR text:11). Note that autoGeneratePhraseQueries="true" tends to not work well for non whitespace delimited languages.• SOLR-2128: Full parameter substitution for function queries. Example: q=add($v1,$v2) &v1=mul(popularity,5)&v2=20.0• SOLR-2133: Function query parser can now parse multiple comma separated value sources. It also now fails if there is extra unexpected text after parsing the functions, instead of silently ignoring it. This allows expressions like q=dist(2,vector(1,2),$pt)&pt=3,4
  11. 11. Functions• SOLR-1574: Add many new functions from java Math (e.g. sin, cos)• SOLR-1569: Allow functions to take in literal strings by modifying the FunctionQParser and adding LiteralValueSource• SOLR-1297: Add sort by Function capability
  12. 12. Analysis• SOLR-1923: PhoneticFilterFactory now has support for the Caverphone algorithm.• SOLR-1571: Added unicode collation support though Lucenes CollationKeyFilter• SOLR-1653: Add PatternReplaceCharFilter• SOLR-1677: Add support for choosing the Lucene Version for Lucene components within Solr.• SOLR-1984: Add HyphenationCompoundWordTokenFilterFactory.• SOLR-2188: Added "maxTokenLength" argument to the factories for ClassicTokenizer, StandardTokenizer, and UAX29URLEmailTokenizer.• ICU integration
  13. 13. Analysis (cont.)• SOLR-1857: Synced Solr analysis with • SOLR-1740: ShingleFilterFactory supports Lucene 3.1. Added the "minShingleSize" and "tokenSeparator" KeywordMarkerFilterFactory and parameters for controlling the minimum StemmerOverrideFilterFactory, which can shingle size produced by the filter, and the be used to tune stemming algorithms. separator string that it uses, respectively.• Added factories for Bulgarian, Czech, Hindi, • SOLR-744: ShingleFilterFactory supports Turkish, and Wikipedia analysis. Improved the "outputUnigramsIfNoShingles" the performance of parameter, to output unigrams if the SnowballPorterFilterFactory. number of input tokens is fewer than minShingleSize, and no shingles can be generated.• SOLR-1657: Converted remaining TokenStreams to the Attributes-based API. All Solr TokenFilters now support custom • SOLR-1974: Add Attributes, and some have improved LimitTokenCountFilterFactory. performance: especially WordDelimiterFilter and CommonGramsFilter. • SOLR-1057: Add PathHierarchyTokenizerFactory.
  14. 14. Faceting• SOLR-1240: "Range Faceting" has been added. This is a generalization of the existing "Date Faceting" logic so that it now supports any all stock numeric field types that support range queries in addition to dates. facet.date is now deprecated in favor of this generalized mechanism.• SOLR-397: Date Faceting now supports a "facet.date.include" param for specifying when the upper & lower end points of computed date ranges should be included in the range. Legal values are: "all", "lower", "upper", "edge", and "outer". For backwards compatibility the default value is the set: [lower,upper,edge], so that all ranges between start and end are inclusive of their endpoints, but the "before" and "after" ranges are not.• SOLR-2325: Allow tagging and exclusion of main query for faceting.
  15. 15. SolrJ• SOLR-1139: Add TermsComponent Query and Response Support in SolrJ• SOLR-1815: SolrJ now preserves the order of facet queries.
  16. 16. Solr Components• SOLR-1316: Create autosuggest component• SOLR-2010: Added ability to verify that spell checking collations have actual results in the index.• SOLR-2157: Suggester should return alpha-sorted results when onlyMorePopular=false• SOLR-1625: Add regexp support for TermsComponent• SOLR-1556: TermVectorComponent now supports per field overrides. Also, it now throws an error if passed in fields do not exist and warnings if fields that do not have term vector options (termVectors, offsets, positions) that align with the schema declaration.• SOLR-860: Add debug output for MoreLikeThis.
  17. 17. Highlighting• SOLR-1268: Incorporate FastVectorHighlighter• SOLR-2021: Add SolrEncoder plugin to Highlighter.• SOLR-2030: Make FastVectorHighlighter use of SolrEncoder.• SOLR-2053: Add support for custom comparators in Solr spellchecker, per LUCENE-2479• SOLR-2049: Add hl.multiValuedSeparatorChar for FastVectorHighlighter, per LUCENE-2603.
  18. 18. Distributed• SOLR-785: Distributed Search support for SpellCheckComponent• SOLR-1177: Distributed Search support for TermsComponent
  19. 19. Misc.• SOLR-1957: The VelocityResponseWriter contrib moved to core. Example search UI now available at http://localhost:8983/solr/browse• SOLR-1966: QueryElevationComponent can now return just the included results in the elevation file• SOLR-1925: Add CSVResponseWriter (use wt=csv) that returns the list of documents in CSV format.• SOLR-2263: Add ability for RawResponseWriter to stream binary files as well as text files.• SOLR-1750: SolrInfoMBeanHandler added for simpler programmatic access to info currently available from registry.jsp and stats.jsp• SOLR-2099: Add ability to throttle rsync based replication using rsync option --bwlimit.
  20. 20. UIMA• UIMA - Unstructured Information Management Architecture - http://uima.apache.org/• Enables UIMA components to augment documents• Entity extraction, automated categorization, language detection, etc• "contrib" plugin - SOLR-2129• http://wiki.apache.org/solr/SolrUIMA
  21. 21. Optimizations• SOLR-1679: Dont build up string messages in SolrCore.execute unless they are necessary for the current log level.• SOLR-1874: Optimize PatternReplaceFilter for better performance.• SOLR-1968: speed up initial filter cache population for facet.method=enum and also big terms for multi-valued facet.method=fc. The resulting speedup for the first facet request is anywhere from 30% to 32x, depending on how many terms are in the field and how many documents match per term.• SOLR-2089: Speed up UnInvertedField faceting (facet.method=fc for multi- valued fields) when facet.limit is both high, and a high enough percentage of the number of unique terms in the field. Extreme cases yield speedups over 3x.• SOLR-2046: add common functions to scripts-util.
  22. 22. Solr 3.2• Ability to specify overwrite and commitWithin as request parameters when using the JSON update format• TermQParserPlugin, useful when generating filter queries from terms returned from field faceting or the terms component.• DebugComponent now supports using a NamedList to model Explanation objects in its responses instead of Explanation.toString• Improvements to the UIMA and Carrot2 integrations• Bugfixes and improvements from Apache Lucene 3.2
  23. 23. Other 3.2 goodies• SOLR-2061: Pull base tests out into a new Solr Test Framework module, and publish binary, javadoc, and source test-framework jars.• Dependency update: Carrot2 3.5.0
  24. 24. Solr 3.3• Grouping / Field Collapsing• A new, automaton-based suggest/autocomplete implementation offering an order of magnitude smaller RAM consumption.• KStemFilterFactory, an optimized implementation of a less aggressive stemmer for English.• Solr defaults to a new, more efficient merge policy (TieredMergePolicy). See http://s.apache.org/merging for more information.• Important bugfixes, including extremely high RAM usage in spellchecking.• Bugfixes and improvements from Apache Lucene 3.3
  25. 25. Solr 3.3 details• SOLR-2378: A new, automaton-based, implementation of suggest (autocomplete) component, offering an order of magnitude smaller memory consumption compared to ternary trees and jaspell and very fast lookups at runtime.• SOLR-2400: Field- and DocumentAnalysisRequestHandler now provide a position history for each token, so you can follow the token through all analysis stages. The output contains a separate int[] attribute containing all positions from previous Tokenizers/TokenFilters (called "positionHistory").• SOLR-2524: (SOLR-236, SOLR-237, SOLR-1773, SOLR-1311) Grouping / Field collapsing using the Lucene grouping contrib. The search result can be grouped by field and query.• SOLR-1331: Added a srcCore parameter to CoreAdminHandlers mergeindexes action to merge one or more cores indexes to a target core.• SOLR-2610 -- Add an option to delete index through CoreAdmin UNLOAD action
  26. 26. Solr 4.0• aka "trunk" at the moment• major changes! (for the better!) at both Lucene and Solr levels
  27. 27. Lucene 4.0• The postings APIs have been removed in favor of the new flexible indexing (flex) APIs.• With flexible indexing it is now possible for an application to create its own postings codec, to alter how fields, terms, docs and positions are encoded into the index.• String -> BytesRef• Per-segment everything
  28. 28. 4.0 details• Directory.copy/Directory.copyTo now copies all files (not just index files), since what is and isnt and index file is now dependent on the codecs used.• String to BytesRef• FuzzyQuery and WildcardQuery now operate on Unicode codepoints, not unicode code units.• WildcardQuery and QueryParser now allows escaping with the character.• Similarity can now be configured on a per-field basis
  29. 29. Relevancy• more flexible scoring
  30. 30. NRT• per-segment• IndexWriter#commit now doesnt block concurrent indexing while flushing all currently RAM resident documents to disk.
  31. 31. More Lucene 4.0 features• Added RegexpQuery support to QueryParser.• Adds AutomatonQuery, a MultiTermQuery that matches terms against a finite-state machine. Implement WildcardQuery and FuzzyQuery with finite-state methods. Adds RegexpQuery.• The QueryParser now accepts mixed inclusive and exclusivebounds for range queries. Example: "{3 TO 5]"
  32. 32. Solr 4.0• Pivot faceting• Direct Solr spell checker• Increased response writing flexibility (e.g. function query results)• Distributed date/numeric range faceting• "join" query parser• NRT:You may now specify a soft commit when committing. This will use Lucenes NRT feature to avoid guaranteeing documents are on stable storage in exchange for faster reopen times. There is also a new soft autocommit tracker that can be configured.
  33. 33. About Lucid...• Lucid Imagination provides commercial-grade support, training, high-level consulting and value- added software for Lucene and Solr.• We make Lucene ‘enterprise-ready’ by offering: • Free, certified, distributions and downloads. • Support, training, and consulting. • LucidWorks Enterprise, a commercial search platform built on top of Solr.• http://www.lucidimagination.com
  34. 34. Lucid Offerings
  35. 35. LucidFindhttp://www.lucidimagination.com/search/?q=charlottesville