Lucene Search Essentials: Scorers, Collectors and Custom Queries
Upcoming SlideShare
Loading in...5
×
 

Lucene Search Essentials: Scorers, Collectors and Custom Queries

on

  • 1,012 views

Presented by Mikhail Khludnev, Principal Engineer, Grid Dynamics ...

Presented by Mikhail Khludnev, Principal Engineer, Grid Dynamics

My team is building next generation eCommerce search platform for major an online retailer with quite challenging business requirements. Turns out, default Lucene toolbox doesn’t ideally fit for those challenges. Thus, the team had to hack deep into Lucene core to achieve our goals. We accumulated quite a deep understanding of Lucene search internals and want to share our experience. We will start with an API overview, and then look at essential search algorithms and their implementations in Lucene. Finally, we will review a few cases of query customization, pitfalls and common performance problems.

Statistics

Views

Total Views
1,012
Views on SlideShare
867
Embed Views
145

Actions

Likes
1
Downloads
32
Comments
0

2 Embeds 145

http://www.lucenerevolution.org 143
http://lucenerevolution.org 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lucene Search Essentials: Scorers, Collectors and Custom Queries Lucene Search Essentials: Scorers, Collectors and Custom Queries Presentation Transcript

  • Scorers, Collectors and Custom Queries Mikhail Khludnev
  • Custom Queries
  • Custom Queries View slide
  • Custom Queries http://nlp.stanford.edu/IR-book/ View slide
  • Custom Queries http://nlp.stanford.edu/IR-book/
  • Custom Queries Match Spotting http://nlp.stanford.edu/IR-book/
  • Custom Queries ..hm what for ?
  • denim dress qf=STYLE TYPE
  • denim dress qf=STYLE TYPE DisjunctionMaxQuery(( (STYLE:denim OR TYPE:denim) | (STYLE:dress OR TYPE:dress) ))
  • denim dress qf=STYLE TYPE ( DisjunctionMaxQuery(( STYLE:denim | TYPE:denim )) )OR( DisjunctionMaxQuery(( STYLE:dress | TYPE::dress )) )
  • Custom Queries
  • Inverted Index
  • T[0] = "it is what it is" T[1] = "what is it" T[2] = "it is a banana"
  • "a": "banana": "is": "it": "what": {2} {2} {0, 1, 2} {0, 1, 2} {0, 1} T[0] = "it is what it is" T[1] = "what is it" T[2] = "it is a banana"
  • "a": "banana": "is": "it": "what": {2} {2} {0, 1, 2} {0, 1, 2} {0, 1} term dictionary postings list
  • index/_1.tis "a" "banana" "is" →"t" "what" index/_1.frq {2} {2} {0, 1, 2} {0, 1, 2} {0, 1}
  • http://www.lib.rochester.edu/index.cfm?PAGE=489
  • What is a Scorer?
  • "a": "banana": "is": "it": "what": {2} {2} {0, 1, 2} {0, 1, 2} {0, 1}
  • "a": "banana": "is": "it": "what": {2} {2} {0, 1, 2} {0, 1, 2} {0, 1}
  • "a": "banana": "is": "it": "what": {2} {2} {0, 1, 2} {0, 1, 2} {0, 1}
  • while( (doc = nextDoc())!=NO_MORE_DOCS){ println("found "+ doc + " with score "+score()); }
  • 2783 issues
  • Note: Weight is omitted for sake of compactness
  • Custom Queries http://nlp.stanford.edu/IR-book/
  • Doc-at-time search
  • "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} what OR is OR a OR banana
  • "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} what OR is OR a OR banana
  • "is": {0, 1, 2} "what": {0, 1} "a": {2} "banana": {2} "it": {0, 1, 2}
  • "is": {0, 1, 2} "what": {0, 1} "a": {2} "banana": {2} collect(0) score():2 Collector
  • "is": {0, 1, 2} "what": {0, 1} "a": {2} "banana": {2} docID×score 0×2
  • "is": {0, 1, 2} "what": {0, 1} "a": {2} "banana": {2} collect(1) score():2 Collector 0×2
  • "is": {0, 1, 2} "what": {0, 1} "a": {2} "banana": {2} Collector 0×2 1×2
  • "is": {0, 1, 2} "a": {2} "banana": {2} "what": {0, 1} collect(2) score():3 Collector 0×2 1×2
  • Term-at-time search "lorem" "ipsum" "dolor" "sit" "amet" "consectetur"
  • "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} what OR is OR a OR banana
  • "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} Accumulator ... 0×1 ... 1×1 ...
  • "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} Accumulator ... 0×2 ... 1×2 ... 2×1 ...
  • "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} Accumulator ... 0×2 ... 1×2 ... 2×2 ...
  • "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} Accumulator ... 0x2 ... 1x2 ... 2x3 ...
  • "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} Accumulator ... 0×2 ... 1×2 ... 2×3 ... Collector 2×3 0×2 1×2
  • O(n) "lorem" "ipsum" "dolor" "sit" "amet" "consectetur" http://nlp.stanford.edu/IR-book/
  • k 1×9 7×9 2×7 2×5 9×5 6×4 ... ... ≤4 ... ... n
  • http://en.wikipedia.org/wiki/Binary_heap
  • 6×4 log k 9×5 2×4 2×7 7×9 1×9 n ... ... ≤4 ... ...
  • "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} p what OR is OR a OR banana q
  • doc at time complexity memory term at time
  • doc at time complexity memory term at time O(p + n log k)
  • "a": {2} "banana": {2} q "is": 1 {0, 1, 2} 1 2 "what": {0, 1} 2
  • doc at time complexity memory term at time O(p log q + n log k) O(p + n log k)
  • doc at time complexity memory term at time O(p log q + n log k) O(p + n log k) q + k
  • doc at time complexity memory term at time O(p log q + n log k) O(p + n log k) q + k n
  • BooleanScorer
  • org.apache.lucene.search.BooleanScorer "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} chunk Hashtable[2] ×1 ×1 0 1
  • org.apache.lucene.search.BooleanScorer "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} chunk x2 x2 0 1
  • org.apache.lucene.search "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} ×2 ×2 0 1 Collector 0×2 1×2
  • org.apache.lucene.search "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} Collector 0×2 1×2 ×1 0 1
  • org.apache.lucene.search "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} Collector 0×2 1×2 ×2 0 1
  • org.apache.lucene.search "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} Collector 0×2 1×2 ×3 0 1
  • org.apache.lucene.search "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} ×3 0 1 Collector 2×3 0×2 1×2
  • Linked Open Hash [2K] ×2 0 ×3 ×1 ×1 ×5 ×2 1 2 3 4 5 6 7
  • if ( collector.acceptsDocsOutOfOrder() && topScorer && required.size() == 0 && minNrShouldMatch == 1) { new BooleanScorer else //term-at-time new BooleanScorer2 //doc-at-time
  • q=village operations years disaster visit
  • q=village operations years disaster visit etc map seventieth peneplains tussock sir memory character campaign author public wonder forker middy vocalize enable race object signal symptom deputy where typhous rectifiable polygamous originally look generation ultimately reasonably ratio numb apposing enroll manhood problem suddenly definitely corp event material affair diploma would dimout speech notion engine artist hotel text field hashed rottener impeding i cricket virtually valley sunday rock come observes gallnuts vibrantly prize involve
  • q=+village +operations +years +disaster +visit
  • Conjunction (+, MUST)
  • "a": {2,3} "banana": {2,3} "is": {0, 1, 2, 3} "it": {0, 1, 3} "what": {0, 1, 3} what AND is AND a AND it
  • "a": {2,3} "banana": {2,3} "is": {0, 1, 2, 3} "it": {0, 1, 3} "what": {0, 1, 3}
  • "a": {2,3} "banana": {2,3} "is": {0, 1, 2, 3} "it": {0, 1, 3} "what": {0, 1, 3}
  • "a": {2,3} "banana": {2,3} "is": {0, 1, 2, 3} "it": {0, 1, 3} "what": {0, 1, 3}
  • "a": {2,3} "banana": {2,3} "is": {0, 1, 2, 3} "it": {0, 1, 3} "what": {0, 1, 3}
  • "a": {2,3} "banana": {2,3} "is": {0, 1, 2, 3} "it": {0, 1, 3} "what": {0, 1, 3} Collector 3x4
  • http://www.flickr.com/photos/fatniu/184615348/
  • Ω(n q + n log k)
  • Wrap-up ● doc-at-time vs term-at-time ● conjunction and leapfrog
  • complexity O(n) memory O(const)
  • Custom Queries http://nlp.stanford.edu/IR-book/
  • Custom Queries ● Sample Coverage Query ● Deeply Branched vs Flat ● minShouldMatch ● Filtering ● Performance Problem
  • silver jeans dress "silver" "jeans" Note: "foo bar" is not a phrase query, just a string "dress"
  • silver jeans dress "silver" "jeans" "dress" "silver jeans dress"
  • silver jeans dress "silver" "jeans" "dress" "silver jeans dress" "silver jeans" "dress" "silver" "jeans dress"
  • silver jeans dress "silver" "jeans" "dress" "silver jeans dress" "silver jeans" "dress" "silver" "jeans dress" "silver" "dress" "silver jeans" "jeans" "silver jeans" "jeans" "dress" Note: "foo bar" is not a phrase query, just a string
  • boolean verifyMatch(){ int sumLength=0; for(Scorer child:getChildren()){ if(child.docID()==docID()){ TermQuery tq=child.weight.query; sumLength += tq.term.text.length; } } return sumLength>=expectedLength; }
  • Deeply Branched vs Flat
  • (+"silver jeans" +"dress") ORmax (+"silver jeans dress") ORmax (+"silver" +( (+"jeans" +"dress") ORmax +"jeans dress" ) ) ORmax is DisjunctionMaxQuery
  • (+"silver jeans" +"dress") ORmax (+"silver jeans dress") ORmax (+"silver" +( (+"jeans" +"dress") ORmax +"jeans dress" ) ) ORmax is DisjunctionMaxQuery
  • (+"silver jeans" +"dress") ORmax (+"silver jeans dress") ORmax (+"silver" +( (+"jeans" +"dress") ORmax +"jeans dress" ) ) ORmax is DisjunctionMaxQuery
  • ("silver jeans" "dress") ORmax ("silver jeans dress") ORmax ("silver" ( ("jeans" "dress") ORmax "jeans dress" ) ) ORmax is DisjunctionMaxQuery
  • + B:"silver jeans" ORmax T:"silver jeans" ORmax S:"silver jeans" + B:"dress" ORmax T:"dress" ORmax S:"dress" B - BRAND T - TYPE S - STYLE ORmax B:"silver jeans dress" ORmax T:"silver jeans dress" ORmax S:"silver jeans dress" ORmax + B:"silver" ORmax T:"silver" ORmax S:"silver" + + B:"jeans" ORmax T:"jeans" ORmax S:"jeans" + B:"dress" ORmax T:"dress" ORmax S:"dress" ORmax B:"jeans dress" ORmax T:"jeans dress" ORmax S:"jeans dress"
  • B:"silver" T:"silver" S:"silver" B:"jeans" T:"jeans" S:"jeans" B:"dress" T:"dress" S:"dress" B:"silver jeans" T:"silver jeans" S:"silver jeans" B:"silver jeans dress" T:"silver jeans dress" S:"silver jeans dress" B:"jeans dress" T:"jeans dress" S:"jeans dress"
  • Steadiness problem AFAIK 3.x only.
  • {1, 3, 7, 10, 27,30,..} {3, 5, 10, 27,32,..} {2,3, 27,31,..} {..., 20, 27,32,..} {..., 30, 31,32,..} {..., 30,37,..} 3 3 20 3 30 30
  • {3, 5, 10, 27,32,..} {1, 3, 7, 10, 27,30,..} {2,3, 27,31,..} {..., 20, 27,32,..} {..., 30, 31,32,..} {..., 30,37,..} docID= 3 5 7 20 27 30 30 3.x
  • minShouldMatch
  • straight silver jeans minShouldMatch=2 straight jeans silver jeans silver jeans straight jeans silver
  • org.apache.lucene.search.DisjunctionSumScorer int nextDoc() { while(true) { while (subScorers[0].docID() == doc) { if (subScorers[0].nextDoc() != NO_DOCS) { heapAdjust(0); } else { .... } } ... if (nrMatchers >= minimumNrMatchers) { break; } } return doc; }
  • Let’s filter! btw, what it is?
  • RANDOM_ACCESS_FILTER_STRATEGY LEAP_FROG_FILTER_FIRST_STRATEGY LEAP_FROG_QUERY_FIRST_STRATEGY QUERY_FIRST_FILTER_STRATEGY
  • http://localhost:8983/solr/collection1/select ?q=village operations years disaster visit etc map seventieth peneplains tussock sir memory character campaign author public wonder forker middy vocalize enable race object signal symptom deputy where generation ultimately reasonably ratio numb apposing enroll manhood problem suddenly definitely corp event gallnuts vibrantly prize involve explanation module& qf=text_all&defType=edismax&
  • http://localhost:8983/solr/collection1/select ?q=village operations years disaster visit etc map seventieth peneplains tussock sir memory character campaign author public wonder forker middy vocalize enable race object signal symptom deputy where generation ultimately reasonably ratio numb apposing enroll manhood problem suddenly definitely corp event gallnuts vibrantly prize involve explanation module& qf=text_all&defType=edismax& fq= id:yes_49912894 id:nurse_30134968&
  • http://localhost:8983/solr/collection1/select ?q=village operations years disaster visit etc map seventieth peneplains tussock sir memory character campaign author public wonder forker middy vocalize enable race object signal symptom deputy where generation ultimately reasonably ratio numb apposing enroll manhood problem suddenly definitely corp event gallnuts vibrantly prize involve explanation module& qf=text_all&defType=edismax& fq= id:yes_49912894 id:nurse_30134968& mm=32&
  • {1, 3, 7, 10, 27,30,..} {3, 5, 10, 27,32,..} { 20,27,31,..} mm=3 { 30,37,..}
  • {1, 3, 7, 10, 27,30,..} {3, 5, 10, 27,32,..} { 20,27,31,..} mm=3 { 30,37,..}
  • {1, 3, 7, 10, 27,30,..} {3, 5, 10, 27,32,..} { 20,27,31,..} mm=3 { 30,37,..}
  • {1, 3, 7, 10, 27,30,..} {3, 5, 10, 27,32,..} { 20,27,31,..} mm=3 { 30,37,..}
  • {1, 3, 7, 10, 27,30,..} {3, 5, 10, 27,32,..} { 20,27,31,..} mm=3 { 30,37,..}
  • Custom Queries Match Spotting http://nlp.stanford.edu/IR-book/
  • BRAND:"silver jeans" BRAND:"alfani" TYPE:"dress" TYPE:"dress" BRAND:"chaloree" TYPE:"dress" STYLE:"white" STYLE:"silver","jeans" STYLE:"silver" BRAND:"style&co" TYPE:"jeans dress" STYLE:"silver" BRAND:"silver jeans" TYPE:"dress" STYLE:"black" BRAND:"silver jeans" TYPE:"dress" STYLE:"white" BRAND:"silver jeans" TYPE:"jacket" STYLE: "black" BRAND:"angie" TYPE:"dress" STYLE:"silver","jeans" BRAND:"chaloree" TYPE:"jeans dress" STYLE:"silver" BRAND:"silver jeans" BRAND:"dotty" BRAND:"chaloree" TYPE:"dress" TYPE:"dress" STYLE:"blue" STYLE:"silver","jeans" STYLE:"jeans" "dress"
  • BRAND:"silver jeans" BRAND:"alfani" TYPE:"dress" TYPE:"dress" BRAND:"chaloree" TYPE:"dress" STYLE:"white" STYLE:"silver","jeans" STYLE:"silver" BRAND:"style&co" TYPE:"jeans dress" STYLE:"silver" BRAND:"silver jeans" TYPE:"dress" STYLE:"black" BRAND:"silver jeans" TYPE:"dress" silver jeans dress STYLE:"white" BRAND:"silver jeans" STYLE: "black" BRAND:"angie" TYPE:"jacket" TYPE:"dress" STYLE:"silver","jeans" BRAND:"chaloree" TYPE:"jeans dress" STYLE:"silver" BRAND:"silver jeans" BRAND:"dotty" BRAND:"chaloree" TYPE:"dress" TYPE:"dress" STYLE:"blue" STYLE:"silver","jeans" STYLE:"jeans" "dress"
  • BRAND:"silver jeans" BRAND:"alfani" TYPE:"dress" STYLE:"white" TYPE:"dress" BRAND:"chaloree" TYPE:"dress" STYLE:"silver","jeans" STYLE:"silver" BRAND:"style&co" TYPE:"jeans dress" STYLE:"silver" BRAND:"silver jeans" TYPE:"dress" STYLE:"black" BRAND:"silver jeans" TYPE:"dress" STYLE:"white" BRAND:"silver jeans" BRAND:"angie" TYPE:"jacket" TYPE:"dress" STYLE: "black" STYLE:"silver","jeans" BRAND:"chaloree" TYPE:"jeans dress" BRAND:"silver jeans" BRAND:"dotty" BRAND:"chaloree" STYLE:"silver" TYPE:"dress" STYLE:"blue" TYPE:"dress" STYLE:"silver","jeans" STYLE:"jeans" "dress"
  • BRAND:"silver jeans" TYPE:"dress" TYPE:"dress" STYLE:"silver","jeans" TYPE:"jeans dress" BRAND:"silver jeans" TYPE:"dress" BRAND:"silver jeans" STYLE:"silver" TYPE:"dress" TYPE:"dress" STYLE:"silver","jeans" TYPE:"jeans dress" BRAND:"silver jeans" STYLE:"silver" TYPE:"dress" TYPE:"dress" STYLE:"silver","jeans"
  • BRAND:"silver jeans" TYPE:"dress" TYPE:"dress" STYLE:"silver","jeans" TYPE:"jeans dress" BRAND:"silver jeans" TYPE:"dress" BRAND:"silver jeans" STYLE:"silver" TYPE:"dress" TYPE:"dress" STYLE:"silver","jeans" TYPE:"jeans dress" BRAND:"silver jeans" STYLE:"silver" TYPE:"dress" TYPE:"dress" STYLE:"silver","jeans"
  • BRAND:"silver jeans" TYPE:"dress" (4) TYPE:"dress" STYLE:"silver","jeans" TYPE:"jeans dress" TYPE:"dress" STYLE:"silver","jeans" TYPE:"jeans dress" TYPE:"dress" STYLE:"silver" STYLE:"silver" STYLE:"silver","jeans"
  • BRAND:"silver jeans" TYPE:"dress" (4) TYPE:"dress" STYLE:"silver","jeans" TYPE:"jeans dress" TYPE:"dress" STYLE:"silver","jeans" TYPE:"jeans dress" TYPE:"dress" STYLE:"silver" STYLE:"silver" STYLE:"silver","jeans"
  • BRAND:"silver jeans" TYPE:"dress" (4) TYPE:"dress" STYLE:"silver","jeans" (3) TYPE:"jeans dress" STYLE:"silver" TYPE:"jeans dress" STYLE:"silver"
  • BRAND:"silver jeans" TYPE:"dress" (4) TYPE:"dress" STYLE:"silver","jeans" (3) TYPE:"jeans dress" STYLE:"silver" (2)
  • silver jeans dress BRAND:"silver jeans" TYPE:"dress" (4) TYPE:"dress" STYLE:"silver","jeans" (3) TYPE:"jeans dress" STYLE:"silver" (2)
  • silver jeans dress BRAND:"silver jeans" TYPE:"dress" (4) TYPE:"dress" STYLE:"silver","jeans" (3) TYPE:"jeans dress" STYLE:"silver" (2)
  • http://goo.gl/7LJFi Scorers, Collectors and Custom Queries http://google.com/+MikhailKhludnev
  • Appendixes ● Drill Sideways Facets ● Collectors
  • Appendix D Drill Sideways Facets
  • +CATEGORY: Denim +FIT: Straight +WASH: Dark&B
  • +CATEGORY: Denim +WASH: Dark&B +CATEGORY: Denim +FIT: Straight +WASH: Dark&B
  • +CATEGORY: Denim +WASH: Dark&B +CATEGORY: Denim +FIT: Straight +WASH: Dark&B +CATEGORY: Denim +FIT: Straight
  • +CATEGORY: Denim FIT: Straight WASH: Dark&Black ... /minShouldMatch=Ndrilldowns-1
  • FIT: Straight +CAT: Denim WASH: Dark
  • FIT: Straight near miss 2 totalHits 3 near miss 2 WASH: Dark +CAT: Denim
  • FIT: Straight near miss 2 totalHits 3 near miss 2 WASH: Dark +CAT: Denim
  • FIT: Straight near miss 2 totalHits 3 near miss 2 WASH: Dark +CAT: Denim
  • Doc at time base query is highly selective
  • +CAT:D..{1, 7, 9, 15 } FIT:S.. {2, 7, 8, 9, 10,12} WASH:D..{2, 7, 11,13,15} ...
  • +CAT:D..{1, 7, 9, 15 } FIT:S.. {2, 7, 8, 9, 10,12} WASH:D..{2, 7, 11,13,15} ...
  • +CAT:D..{1, 7, 9, 15 } FIT:S.. {2, 7, 8, 9, 10,12} WASH:D..{2, 7, 11,13,15} ...
  • +CAT:D..{1, 7, 9, 15 } FIT:S.. {2, 7, 8, 9, 10,12} WASH:D..{2, 7, 11,13,15} ... TopDocsCollector
  • +CAT:D..{1, 7, 9, 15 } FIT:S.. {2, 7, 8, 9, 10,12} WASH:D..{2, 7, 11,13,15} ... TopDocsCollector
  • +CAT:D..{1, 7, 9, 15 } FIT:S.. {2, 7, 8, 9, 10,12} WASH:D..{2, 7, 11,13,15} ... TopDocsCollector
  • +CAT:D..{1, 7, 9, 15 } FIT:S.. {2, 7, 8, 9, 10,12} WASH:D..{2, 7, 11,13,15} ... TopDocsCollector
  • +CAT:D..{1, 7, 9, 15 } FIT:S.. {2, 7, 8, 9, 10,12} WASH:D..{2, 7, 11,13,15} ... TopDocsCollector
  • +CAT:D..{1, 7, 9, 15 } FIT:S.. {2, 7, 8, 9, 10,12} WASH:D..{2, 7, 11,13,15} ... TopDocsCollector
  • Term at time drilldown queries are highly selective
  • +CAT:D..{1, 7, 9, 15 } FIT:S.. {2, 7, 8, 9, 10,12} WASH:D..{2, 7, 11,13,15} ... hits 1 miss Fit 1 2 ... hits 1 miss Fit 7 hits 1 miss Fit 8 9 10 11 hits hits 1 1 miss miss Fit Fit 12 13 15
  • +CAT:D..{1, 7, 9, 15 } FIT:S.. {2, 7, 8, 9, 10,12} WASH:D..{2, 7, 11,13,15} ... hits 2 miss no 1 2 ... hits hits hits hits hits hits hits hits 2 1 1 1 1 1 1 1 miss miss miss miss miss miss miss miss no Wash Wash Wash Fit Wash Fit Fit 7 8 9 10 11 12 13 15
  • +CAT:D..{1, 7, 9, 15 } FIT:S.. {2, 7, 8, 9, 10,12} WASH:D..{2, 7, 11,13,15} ... hits 2 miss Cat 1 2 ... hits hits hits hits hits hits hits hits 3 1 1 1 1 2 2 1 miss miss miss miss miss miss miss miss Wash Wash Fit Wash Wash Fit Cat Fit Cat Cat Cat Cat 7 8 9 10 11 12 13 15
  • hits 2 miss Cat 1 2 ... hits hits hits hits hits hits hits hits 3 1 1 1 1 2 2 1 miss miss miss miss miss miss miss miss Fit no Wash Wash Wash Cat Wash Fit Cat Fit Cat Cat Cat 7 8 9 10 11 12 13 15
  • TopDocsCollector hits 3 miss ... 1 2 no 7 hits 2 miss Fit hits 2 miss Wash 8 9 10 11 12 13 15
  • TopDocsCollector hits 3 miss ... 1 2 no 7 hits 2 miss Fit hits 2 miss Wash 8 9 10 11 12 13 15
  • TopDocsCollector hits 3 miss ... 1 2 no 7 hits 2 miss Fit hits 2 miss Wash 8 9 10 11 12 13 15
  • Collector DocSetCollector TopDocsCollector TopFieldCollector TopScoreDocsCollector
  • DocSet or DocList? long [952045] = { 0, 0, 0, 0, 2050, 0, 0, 8, 0, 0, 0,... } int [2079] = {4, 12, 45, 67, 103, 673, 5890, 34103,...} int [100] = {8947, 7498,1, 230, 2356, 9812, 167,....}
  • DocList/ TopDoc DocSet Size k (numHits or rows) N (maxDocs) Ordered by score or field docID allows* almost could allow (No) Out-of-order collecting
  • ?×4 6×4 9×5 2×4 2×7 7×9 1×9
  • http://www.flickr.com/photos/jbagley/4303976811/sizes/o/
  • class OutOfOrderTopScoreDocCollector boolean acceptsDocsOutOfOrder(){ return true; } .. void collect(int doc) { float score = scorer.score(); ... if (score == pqTop.score && doc > pqTop.doc) { ... }
  • UML http://www.flickr.com/photos/kristykay/2922670979/lightbox/