What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

1,029 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,029
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

  1. 1. What’s new in Lucene and Solr? Grant Ingersoll CTO, LucidWorks Lucene/Solr Committer
  2. 2. Sink or Swim?
  3. 3. Search is good for… • Traditional: Fast, fuzzy text matching across a large document collection • De-normalized data – “light” relational • Top N problems – Key-value (top 1) – Recommendations, “Good enough” classification, clustering • Faceting, slicing and dicing of numerical/enumerated data • Spatial, spell checking, record linkage, highlighting • NoSQL
  4. 4. What’s New? • Community • Lucene • Solr
  5. 5. Relax, You’re Among Friends • Large, diverse search community with many non-traditional search engine usages – Object stores, Record linkage, Social, mobile -> web • “The Apache Way” – Meritocracy – Those who do, decide! • Always Be Testing – Randomized system tests are all the rage – http://vimeo.com/32087114 • Patches Welcome!
  6. 6. Acceleration!
  7. 7. Coming Soon: Lucene and Solr 4.8 Java 1.7
  8. 8. Lucene: Speed and Memory • Native Near Real Time (NRT) support – Per segment – FieldCache can be controlled to only load new segments – Soft commit -- faster without fsync, allows quicker update visibility • DWPT (Document Writer per Thread) – Faster more consistent index speed • Faster fuzzy & wildcard query processing • Automatic compression of stored fields and term vectors • String -> BytesRef – Much improved data structure – … means less memory and less garbage collection effort
  9. 9. Lucene: Flexibility • Flexible Index Formats – New posting list codecs: Block, Simple Text, HDFS, etc. – Pulsing codec: improves performance of primary key searches, inlining docs, positions, and payloads, saves disk seeks • Pluggable Scoring – Decoupled from TF/IDF – Built in alternatives include BM25 & DFR, and others • http://en.wikipedia.org/wiki/Okapi_BM25 • http://terrier.org/docs/v3.5/dfr_description.html – Add your own
  10. 10. FS(A|T) • Keys: – byte[] – write-once – Linear time build of min. automata – Compression, Reverse lookups – Weights (used for auto-suggest) – Pluggable Algebra • Uses: – Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others – FuzzyQuery is 100x faster -- http://bit.ly/hgO65c • More: – http://slidesha.re/vKtpVA, http://bit.ly/Pkjyu0 – “Smaller Representation of Finite State Automata” • Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807, 2011, pp. 118—192.
  11. 11. Grab Bag • Lots of new suggesters – Available in Solr • Doc Values – Column oriented store – Numeric and binary variants are updatable (coming to Solr soon) • Overhauled term vectors APIs – Now look a lot like Terms
  12. 12. Solr 4: New Features • Search/Faceting/Relevance – New Relevance Function Queries (tf, df, others) – Pivot Faceting – Pseudo-join – Improved Spatial (more later) – Full support for Lucene Codecs, pluggable scoring • Indexing – New Update Processors, including scripting option – Near real time • Schema and Config APIs + Schemaless • Cursors (aka Deep Paging) • Admin UI
  13. 13. Geospatial improvements • Index shapes other than points (circles, polygons, etc) • More complex interactions than point in a circle • Indexing: – "geo”:”43.17614,-90.57341” – “geo”:”Circle(4.56,1.23 d=0.0710)” – “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))” • Searching: – fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)" – fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))”
  14. 14. Scaling Solr • Distributed/sharded indexing & search – Auto distributes updates and queries to appropriate shards – Near Real Time (NRT) indexing capable – Document routing extensions • Dynamically scalable – New SolrCloud instances add indexing and query capacity – Supports re-balancing (shard-splitting) • Reliable – No single point of failure – Transactions logged – Robust, automatic recover • http://wiki.apache.org/solr/SolrCloud
  15. 15. Solr as NoSQL • Non-traditional data stores • Not designed for SQL type queries • Distributed fault tolerant architecture • Document oriented, data format agnostic (JSON, XML, CSV, binary)
  16. 16. Go Deep!
  17. 17. APIs • New APIs for Schema and Solr Config – XML becoming more of an implementation detail • Managed Schema mode • Data-driven schema (aka schemaless) • Synonyms, stopwords, request handlers
  18. 18. Beyond Solr: LucidWorks Open Source • Effortless AWS deployment and monitoring: http://www.github.com/lucidworks/solr-scale-tk • Logstash for Solr: https://github.com/LucidWorks/solrlogmanager • Banana (Kibana for Solr): https://github.com/LucidWorks/banana • Data Quality Toolkit: https://github.com/LucidWorks/data-quality • Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash
  19. 19. Summary • Lucene/Solr 4.x: – Faster – More Flexible – Easier than ever scaling – More reliable than ever • Go forth and rank!
  20. 20. Resources • Me – grant@lucidworks.com – @gsingers on Twitter • LucidWorks – http://www.lucidworks.com – http://www.lucidworks.com/support-services/ask-the-experts/

×