Your SlideShare is downloading. ×
0
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Solr 4
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Solr 4

1,930

Published on

Solr 4.0 dramatically improves scalability, performance, and flexibility. An overhauled Lucene underneath sports near real-time (NRT) capabilities allowing indexed documents to be rapidly visible and …

Solr 4.0 dramatically improves scalability, performance, and flexibility. An overhauled Lucene underneath sports near real-time (NRT) capabilities allowing indexed documents to be rapidly visible and searchable. Lucene’s improvements also include pluggable scoring, much faster fuzzy and wildcard querying, and vastly improved memory usage. These Lucene improvements automatically make Solr much better, and Solr magnifies these advances with “SolrCloud.” SolrCloud enables highly available and fault tolerant clusters for large scale distributed indexing and searching. There are many other changes that will be surveyed as well. This talk will cover these improvements in detail, comparing and contrasting to previous versions of Solr.

Published in: Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,930
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
57
Comments
0
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Solr 4 Presented by Erik Hatcher© Copyright 2012
  • 2. About: Erik Hatcher • “Lucene in Action”, co-author -  And also “Java Development with Ant”/”Ant in Action” co-author • Open Source -  Apache Software Foundation: member, Lucene/Solr committer and PMC -  Originator of “Blacklight”, a Solr-powered discovery interface • LucidWorks -  Co-founder -  Recently renamed from Lucid Imagination -  Customer Support © 2012 LucidWorks2
  • 3. Abstract Solr 4.0 dramatically improves scalability, performance, and flexibility. An overhauled Lucene underneath sports near real-time (NRT) capabilities allowing indexed documents to be rapidly visible and searchable. Lucene’s improvements also include pluggable scoring, much faster fuzzy and wildcard querying, and vastly improved memory usage. These Lucene improvements automatically make Solr much better, and Solr magnifies these advances with “SolrCloud.” SolrCloud enables highly available and fault tolerant clusters for large scale distributed indexing and searching. There are many other changes that will be surveyed as well. This talk will cover these improvements in detail, comparing and contrasting to previous versions of Solr. © 2012 LucidWorks3
  • 4. Lucene 4 Improvements • Flexible index formats • Pluggable scoring • String -> BytesRef • DWPT (Document Writer Per Thread) -  faster, more consistent indexing speed • NRT (Near Real-Time) • Spatial overhaul • FST/FSA -  FuzzyQuery over 100x faster -  also reduces memory footprint for Terms index • DocValues: aka column-stride fields © 2012 LucidWorks4
  • 5. Flexible index formats • For terms, postings lists, stored fields, term vectors, etc • Several new posting list codecs -  Pulsing (inlines low doc freq) -  Block (packed int blocks) -  SimpleText (debugging, transparency) -  Bloom (experimental, also inlines low doc freq) -  Appending (for append-only filesystems such as HDFS) -  Memory (terms as FST) © 2012 LucidWorks5
  • 6. Pluggable scoring • Decoupled from traditional vector space (TF/IDF) • Additional index statistics -  number of tokens for a term or field -  number of postings for a field -  number of documents with a posting for a field • Several built-in alternatives: -  BM25 -  DFR – divergence from randomness -  Information-based models • “norms” are no longer limited to a single byte -  Similarity implementations can use any DocValues type to store norms © 2012 LucidWorks6
  • 7. String -> BytesRef • How many bytes does a Java String require? -  BytesRef is now used to avoid this overhead -  Think of the internal structure as a big buffer with pointers • Garbage collection much more efficient -  big blocks rather than zillions of small ones • How much reduction? 10%? 20%? -  No. Way more than that © 2012 LucidWorks7
  • 8. NRT: Near Real-Time • Per-segment -  FieldCache needs to only load from new segments • Soft commit -  Faster: does not fsync -  Can soft commit very rapidly, as low as every second © 2012 LucidWorks8
  • 9. Lucene 4: there’s more • AutomatonQuery -  term matching a provided finite-state automaton • Term offsets -  optionally encoded into the postings lists and can be retrieved per-position • DirectSpellChecker -  finds possible corrections directly against the main search index without requiring a separate index • DWPT -  Flushing new segment is now concurrent w/ indexing © 2012 LucidWorks9
  • 10. Indexing performance (Wikipedia 4KB docs) • http://people.apache.org/~mikemccand/lucenebench/ indexing.html © 2012 LucidWorks10
  • 11. QPS (primary key lookup) • http://people.apache.org/~mikemccand/lucenebench/ PKLookup.html © 2012 LucidWorks11
  • 12. FuzzyQuery • http://people.apache.org/~mikemccand/lucenebench/ Fuzzy2.html © 2012 LucidWorks12
  • 13. Solr 4 Highlights •  SolrJ streaming response •  Pivot facets •  New relevancy function queries -  termfreq, tf, docfreq, idf norm, maxdoc, numdocs, exists, if, and, or, xor, not, def, and true and false constants •  DirectSpellChecker support •  Improved document response: DocTransformer, function calculations •  Pseudo-join •  New admin UI: Including SolrCloud cluster visualizations •  Transaction log •  Several new update processors, including a “script” one •  Spatial overhaul •  Content-type savvy /update handler •  SolrCloud © 2012 LucidWorks13
  • 14. Per-segment faceting improvement • Field-cache, per segment -  Test index: 10M documents, 18 segments, single valued field • facet.method=fcs • Result set=100 docs, 100,000 unique terms -  static index fc=3ms fcs=244 ms -  quickly changing index fc=1388 ms, fcs=267 ms • Result set=1,000,000 docs, 100 unique terms -  static index fc=26 ms fcs=34 ms -  quickly changing index fc=741 ms, fcs=94 ms • Data from Yonik’s Lucene Revolution 2011 faceting talk © 2012 LucidWorks14
  • 15. Solr 3.x scalability • Capabilities: -  Replication -  Distributed search • Limitations: -  Documents only available after (expensive) “hard” commit, replication, and warming delays -  Configuration labor intensive, manually maintained and coordinated -  Manual sharding: no automatic distributed indexing -  Failure recovery difficult if master goes down © 2012 LucidWorks15
  • 16. SolrCloud: Solr 4’s scalability • Sharded leaders and replicas • ZooKeeper used for cluster management • Distributed indexing -  Automatically distributes updates to appropriate shard -  Facilitates Near Real-Time (NRT) searching • Distributed search -  Automatically distributes to nodes of each shard • Robust, automatic update recovery • Real-time /get -  Leverages transaction log • No single point of failure • Large scale NRT using soft commits © 2012 LucidWorks16
  • 17. SolrCloud details • “Leaders” and “replicas” -  Leaders are automatically elected • Leaders are just a replica with some coordination responsibilities for the associated replicas • If a leader goes down, one of the associated replicas is elected as the new leader • New nodes are automatically assigned a shard and role, and replicate/recover as needed • CloudSolrServer • Replication in Solr 4 -  Used for new and recovering replicas -  Or for traditional master/slave configuration © 2012 LucidWorks17
  • 18. NoSQL • Update durability -  A transaction log ensures that even uncommitted documents are never lost. • Real-time Get -  The ability to quickly retrieve the latest version of a document, without the need to commit or open a new searcher • Versioning and Optimistic Locking -  combined with real-time get, this allows read-update-write functionality that ensures no conflicting changes were made concurrently by other clients. • Atomic updates -  the ability to add, remove, change, and increment fields of an existing document without having to send in the complete document again. © 2012 LucidWorks18
  • 19. Some numbers • On a Wikipedia index (11M documents) -  Time to perform the first query with sorting (no warmup queries) Solr 3x: 13 seconds, Solr 4: 6 seconds. -  Memory consumption Solr 3x: 1,040M, Solr 4: 366M. Yes, almost a 2/3 reduction in memory use. And that’s the entire program size, not counting memory used to just start Solr and Jetty running. -  Number of objects on the heap. Solr 3x: 19.4M, Solr 4: 80K. No, that’s not a typo. There are over two orders of magnitude fewer objects on the heap in trunk! • From an Erick Erickson blog entry (see Links slide) © 2012 LucidWorks19
  • 20. Links • Lucene/Solr: lucene.apache.org • “Lucene in Action”: www.manning.com/lucene • Blacklight -  projectblacklight.org -  Examples: search.lib.virginia.edu and searchworks.stanford.edu • SearchHub.org -  Community/public content -  http://searchhub.org/dev/2012/04/06/memory-comparisons- between-solr-3x-and-trunk/ © 2012 LucidWorks20
  • 21. About LucidWorks • LucidWorks Search -  Lucene/Solr 4 powered -  On-premise or hosted (Amazon EC2 and Azure) -  Rich connector framework for SharePoint, web crawling, etc -  Built-in security support • LucidWorks Big Data -  Scalable classification, machine learning, analytics • Lucene/Solr commercial support • Consulting • Training • http://www.lucidworks.com © 2012 LucidWorks21

×