Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

1,087 views

Published on

Published in: Technology
  • Be the first to comment

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

  1. 1. What’s new in Lucene and Solr? Grant Ingersoll CTO, LucidWorks Lucene/Solr Committer
  2. 2. Sink or Swim?
  3. 3. Search is good for… • Traditional: Fast, fuzzy text matching across a large document collection • De-normalized data – “light” relational • Top N problems – Key-value (top 1) – Recommendations, “Good enough” classification, clustering • Faceting, slicing and dicing of numerical/enumerated data • Spatial, spell checking, record linkage, highlighting • NoSQL
  4. 4. What’s New? • Community • Lucene • Solr
  5. 5. Relax, You’re Among Friends • Large, diverse search community with many non-traditional search engine usages – Object stores, Record linkage, Social, mobile -> web • “The Apache Way” – Meritocracy – Those who do, decide! • Always Be Testing – Randomized system tests are all the rage – http://vimeo.com/32087114 • Patches Welcome!
  6. 6. Acceleration!
  7. 7. Coming Soon: Lucene and Solr 4.8 Java 1.7
  8. 8. Lucene: Speed and Memory • Native Near Real Time (NRT) support – Per segment – FieldCache can be controlled to only load new segments – Soft commit -- faster without fsync, allows quicker update visibility • DWPT (Document Writer per Thread) – Faster more consistent index speed • Faster fuzzy & wildcard query processing • Automatic compression of stored fields and term vectors • String -> BytesRef – Much improved data structure – … means less memory and less garbage collection effort
  9. 9. Lucene: Flexibility • Flexible Index Formats – New posting list codecs: Block, Simple Text, HDFS, etc. – Pulsing codec: improves performance of primary key searches, inlining docs, positions, and payloads, saves disk seeks • Pluggable Scoring – Decoupled from TF/IDF – Built in alternatives include BM25 & DFR, and others • http://en.wikipedia.org/wiki/Okapi_BM25 • http://terrier.org/docs/v3.5/dfr_description.html – Add your own
  10. 10. FS(A|T) • Keys: – byte[] – write-once – Linear time build of min. automata – Compression, Reverse lookups – Weights (used for auto-suggest) – Pluggable Algebra • Uses: – Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others – FuzzyQuery is 100x faster -- http://bit.ly/hgO65c • More: – http://slidesha.re/vKtpVA, http://bit.ly/Pkjyu0 – “Smaller Representation of Finite State Automata” • Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807, 2011, pp. 118—192.
  11. 11. Grab Bag • Lots of new suggesters – Available in Solr • Doc Values – Column oriented store – Numeric and binary variants are updatable (coming to Solr soon) • Overhauled term vectors APIs – Now look a lot like Terms
  12. 12. Solr 4: New Features • Search/Faceting/Relevance – New Relevance Function Queries (tf, df, others) – Pivot Faceting – Pseudo-join – Improved Spatial (more later) – Full support for Lucene Codecs, pluggable scoring • Indexing – New Update Processors, including scripting option – Near real time • Schema and Config APIs + Schemaless • Cursors (aka Deep Paging) • Admin UI
  13. 13. Geospatial improvements • Index shapes other than points (circles, polygons, etc) • More complex interactions than point in a circle • Indexing: – "geo”:”43.17614,-90.57341” – “geo”:”Circle(4.56,1.23 d=0.0710)” – “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))” • Searching: – fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)" – fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))”
  14. 14. Scaling Solr • Distributed/sharded indexing & search – Auto distributes updates and queries to appropriate shards – Near Real Time (NRT) indexing capable – Document routing extensions • Dynamically scalable – New SolrCloud instances add indexing and query capacity – Supports re-balancing (shard-splitting) • Reliable – No single point of failure – Transactions logged – Robust, automatic recover • http://wiki.apache.org/solr/SolrCloud
  15. 15. Solr as NoSQL • Non-traditional data stores • Not designed for SQL type queries • Distributed fault tolerant architecture • Document oriented, data format agnostic (JSON, XML, CSV, binary)
  16. 16. Go Deep!
  17. 17. APIs • New APIs for Schema and Solr Config – XML becoming more of an implementation detail • Managed Schema mode • Data-driven schema (aka schemaless) • Synonyms, stopwords, request handlers
  18. 18. Beyond Solr: LucidWorks Open Source • Effortless AWS deployment and monitoring: http://www.github.com/lucidworks/solr-scale-tk • Logstash for Solr: https://github.com/LucidWorks/solrlogmanager • Banana (Kibana for Solr): https://github.com/LucidWorks/banana • Data Quality Toolkit: https://github.com/LucidWorks/data-quality • Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash
  19. 19. Summary • Lucene/Solr 4.x: – Faster – More Flexible – Easier than ever scaling – More reliable than ever • Go forth and rank!
  20. 20. Resources • Me – grant@lucidworks.com – @gsingers on Twitter • LucidWorks – http://www.lucidworks.com – http://www.lucidworks.com/support-services/ask-the-experts/

×