WRITING CUSTOM QUERIES: SCORERSDIVERSITY AND TRAPSMikhail KhludnevPrincipal Engineer,eCommerce Search Teammkhludnev@griddy...
CustomQueries
CustomQueries
CustomQuerieshttp://nlp.stanford.edu/IR-book/
CustomQuerieshttp://nlp.stanford.edu/IR-book/
CustomQueriesMatch Spottinghttp://nlp.stanford.edu/IR-book/
CustomQueries
Inverted Index
T[0] = "it is what it is"T[1] = "what is it"T[2] = "it is a banana"
"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}T[0] = "it is what it is"T[1] = "what is it"T[2] = "it is...
"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1} postings list
What is a Scorer?
"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
while((doc = nextDoc())!=NO_MORE_DOCS){println("found "+ doc +" with score "+score());}
Note: Weight is omitted for sake of compactness
CustomQuerieshttp://nlp.stanford.edu/IR-book/
Doc-at-time search
"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}what OR is OR a OR banana
"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}what OR is OR a OR banana
"is": {0, 1, 2}"what": {0, 1}"a": {2}"banana": {2}"it": {0, 1, 2}
"is": {0, 1, 2}"what": {0, 1}"a": {2}"banana": {2}collect(0)score():2Collector
"is": {0, 1, 2}"what": {0, 1}"a": {2}"banana": {2}docID×score0×2
"is": {0, 1, 2}"what": {0, 1}"a": {2}"banana": {2}collect(1)score():2Collector0×2
"is": {0, 1, 2}"what": {0, 1}"a": {2}"banana": {2}Collector0×21×2
"is": {0, 1, 2}"a": {2}"banana": {2}"what": {0, 1}collect(2)score():3Collector0×21×2
"is": {0, 1, 2}"a": {2}"banana": {2}"what": {0, 1}Collector2×30×21×2
Term-at-time searchsee Appendix
doc at time term at timecomplexity O(p log q + n log k) O(p + n log k)memory q + k n
q=village operations years disaster visit etcmap seventieth peneplains tussock sirmemory character campaign author publicw...
q=village operations years disaster visit
q=+village +operations +years +disaster +visit
Conjunction(+, MUST)
"a": {2,3}"banana": {2,3}"is": {0, 1, 2, 3}"it": {0, 1, 3}"what": {0, 1, 3}what AND is AND a AND it
"a": {2,3}"banana": {2,3}"is": {0, 1, 2, 3}"it": {0, 1, 3}"what": {0, 1, 3}
"a": {2,3}"banana": {2,3}"is": {0, 1, 2, 3}"it": {0, 1, 3}"what": {0, 1, 3}
"a": {2,3}"banana": {2,3}"is": {0, 1, 2, 3}"it": {0, 1, 3}"what": {0, 1, 3}
"a": {2,3}"banana": {2,3}"is": {0, 1, 2, 3}"it": {0, 1, 3}"what": {0, 1, 3}
"a": {2,3}"banana": {2,3}"is": {0, 1, 2, 3}"it": {0, 1, 3}"what": {0, 1, 3}Collector3 x 4
http://www.flickr.com/photos/fatniu/184615348/
Ω(n q + n log k)
Wrap-up● doc-at-time vs term-at-time● leapfrog
CustomQuerieshttp://nlp.stanford.edu/IR-book/
Custom Queries● HelloWorld● Deeply Branched vs Flat● Steadiness Problem● minShouldMatch Performance Problem● Filtering Per...
"silver" "jeans" "dress"silver jeans dressNote: "foo bar" is not a phrase query, just a string
"silver" "jeans" "dress""silver jeans dress"silver jeans dress
"silver" "jeans" "dress""silver jeans dress""silver jeans" "dress""silver" "jeans dress"silver jeans dress
"silver" "jeans" "dress""silver jeans dress""silver jeans" "dress""silver" "jeans dress""silver" "dress""silver jeans" "je...
boolean verifyMatch(){int sumLength=0;for(Scorer child:getChildren()){if(child.docID()==docID()){TermQuery tq=child.weight...
Deeply Branched vs Flat
(+"silver jeans" +"dress")ORmax(+"silver jeans dress")ORmax(+"silver" +((+"jeans" +"dress")ORmax+"jeans dress"))ORmaxis Di...
(+"silver jeans" +"dress")ORmax(+"silver jeans dress")ORmax(+"silver" +((+"jeans" +"dress")ORmax+"jeans dress"))ORmaxis Di...
(+"silver jeans" +"dress")ORmax(+"silver jeans dress")ORmax(+"silver" +((+"jeans" +"dress")ORmax+"jeans dress"))ORmaxis Di...
("silver jeans" "dress")ORmax("silver jeans dress")ORmax("silver" (("jeans" "dress")ORmax"jeans dress"))ORmaxis Disjunctio...
B:"silver jeans dress" ORmaxT:"silver jeans dress" ORmaxS:"silver jeans dress"B:"silver" ORmaxT:"silver" ORmaxS:"silver"+B...
B:"silver" T:"silver" S:"silver"B:"jeans" T:"jeans" S:"jeans"B:"dress" T:"dress" S:"dress"B:"silver jeans" T:"silver jeans...
Steadiness problemAFAIK 3.x only.
{1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{2,3, 27,31,..}{..., 30,37,..}33 203 30 30{..., 30, 31,32,..}{..., 20, 27,32,..}
{1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{2,3, 27,31,..}{..., 30,37,..}57 2027 30 30{..., 30, 31,32,..}{..., 20, 27,32,....
straight jeanssilver jeanssilver jeans straightjeanssilverminShouldMatch=2straight silver jeans
int nextDoc() {while(true) {while (subScorers[0].docID() == doc) {if (subScorers[0].nextDoc() != NO_DOCS) {heapAdjust(0);}...
{1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{ 20,27,31,..}mm=3 { 30,37,..}
{1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{ 20,27,31,..}mm=3 { 30,37,..}
{1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{ 20,27,31,..}mm=3 { 30,37,..}
{1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{ 20,27,31,..}mm=3 { 30,37,..}
{1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{ 20,27,31,..}mm=3 { 30,37,..}
Filtering
RANDOM_ACCESS_FILTER_STRATEGYLEAP_FROG_FILTER_FIRST_STRATEGYLEAP_FROG_QUERY_FIRST_STRATEGYQUERY_FIRST_FILTER_STRATEGY
minShouldMatch meetsFilters
http://localhost:8983/solr/collection1/select?q={!cache=false}village AND village operations years disaster visit etcmap s...
CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge getsyou in the doorTOMORROWBr...
Appendixes● Term-at-time search in Lucene/Solr● Derivation of the search complexity● Match Spotting● Drill Sideways Facets
Appendix BTerm-at-time Searchin Lucene
"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}what OR is OR a OR banana
Accumulator... 0×1 ... 1×1 ..."a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
Accumulator... 0×2 ... 1×2 ... 2×1 ..."a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}Accumulator... 0×2 ... 1×2 ... 2×2 ...
"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}Accumulator... 0x2 ... 1x2 ... 2x3 ...
Accumulator... 0×2 ... 1×2 ... 2×3 ...Collector2×30×21×2"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
BooleanScorer2
×1"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}Hashtable[2]org.apache.lucene.search.BooleanScorer×10 1...
x2"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}org.apache.lucene.search.BooleanScorerx20 1chunk
org.apache.lucene.searchCollector0×21×2×2 ×20 1"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
org.apache.lucene.searchCollector0×21×2×10 1"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
org.apache.lucene.searchCollector0×21×2×20 1"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
org.apache.lucene.searchCollector0×21×2×30 1"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
org.apache.lucene.searchCollector2×30×21×2×30 1"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
new BooleanScorernew BooleanScorer2//term-at-time//doc-at-timeif ( collector.acceptsDocsOutOfOrder() &&topScorer &&require...
Linked Open Hash [2K]×1 ×1 ×5 ×2×20 1 2 3 4 5 6 7×3
CollectorDocSetCollector TopDocsCollectorTopFieldCollectorTopScoreDocsCollector
long [952045] = { 0, 0, 0, 0, 2050, 0, 0, 8, 0, 0, 0,... }int [2079] = {4, 12, 45, 67, 103, 673, 5890, 34103,...}int [100]...
DocList/TopDocs DocSetSizeOrdered byOut-of-ordercollectingk(numHits/rows)N(maxDocs)score orfielddocIDallows* almostcouldal...
?×4 6×49×5 2×42×7 7×9 1×9
http://www.flickr.com/photos/jbagley/4303976811/sizes/o/
class OutOfOrderTopScoreDocCollectorboolean acceptsDocsOutOfOrder(){ return true;}..void collect(int doc) {float score = s...
Appendix BDerivation of theSearch Complexity
1×97×92×72×59×56×4......≤4......kn
http://en.wikipedia.org/wiki/Binary_heap
6×4log k 9×5 2×42×7 7×9 1×9......≤4......n
qp"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}what OR is OR a OR banana
doc at time term at timecomplexity O(p + n log k)memory
qp"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}what OR is OR a OR banana11 22
doc at time term at timecomplexity O(p log q + n log k) O(p + n log k)memory
CustomQueriesMatch Spottinghttp://nlp.stanford.edu/IR-book/Appendix C
BRAND:"silver jeans" TYPE:"dress" STYLE:"white"BRAND:"alfani" TYPE:"dress" STYLE:"silver","jeans"BRAND:"chaloree" TYPE:"dr...
BRAND:"silver jeans" TYPE:"dress" STYLE:"white"BRAND:"alfani" TYPE:"dress" STYLE:"silver","jeans"BRAND:"chaloree" TYPE:"dr...
BRAND:"silver jeans" TYPE:"dress" STYLE:"white"BRAND:"alfani" TYPE:"dress" STYLE:"silver","jeans"BRAND:"chaloree" TYPE:"dr...
BRAND:"silver jeans" TYPE:"dress"TYPE:"dress" STYLE:"silver","jeans"TYPE:"jeans dress" STYLE:"silver"BRAND:"silver jeans" ...
BRAND:"silver jeans" TYPE:"dress"TYPE:"dress" STYLE:"silver","jeans"TYPE:"jeans dress" STYLE:"silver"BRAND:"silver jeans" ...
BRAND:"silver jeans" TYPE:"dress" (4)TYPE:"dress" STYLE:"silver","jeans"TYPE:"jeans dress" STYLE:"silver"TYPE:"dress" STYL...
BRAND:"silver jeans" TYPE:"dress" (4)TYPE:"dress" STYLE:"silver","jeans"TYPE:"jeans dress" STYLE:"silver"TYPE:"dress" STYL...
BRAND:"silver jeans" TYPE:"dress" (4)TYPE:"dress" STYLE:"silver","jeans" (3)TYPE:"jeans dress" STYLE:"silver"TYPE:"jeans d...
BRAND:"silver jeans" TYPE:"dress" (4)TYPE:"dress" STYLE:"silver","jeans" (3)TYPE:"jeans dress" STYLE:"silver" (2)
BRAND:"silver jeans" TYPE:"dress" (4)TYPE:"dress" STYLE:"silver","jeans" (3)TYPE:"jeans dress" STYLE:"silver" (2)silver je...
BRAND:"silver jeans" TYPE:"dress" (4)TYPE:"dress" STYLE:"silver","jeans" (3)TYPE:"jeans dress" STYLE:"silver" (2)silver je...
Appendix DDrill Sideways Facets
+CATEGORY: Denim+FIT: Straight+WASH: Dark&B
+CATEGORY: Denim+FIT: Straight+WASH: Dark&B+CATEGORY: Denim+WASH: Dark&B
+CATEGORY: Denim+FIT: Straight+WASH: Dark&B+CATEGORY: Denim+WASH: Dark&B+CATEGORY: Denim+FIT: Straight
+CATEGORY: DenimFIT: StraightWASH: Dark&Black.../minShouldMatch=Ndrilldowns-1
+CAT: DenimFIT: StraightWASH: Dark
+CAT: DenimFIT: StraightWASH: DarktotalHits3near miss2near miss2
+CAT: DenimFIT: StraightWASH: DarktotalHits3near miss2near miss2
+CAT: DenimFIT: StraightWASH: DarktotalHits3near miss2near miss2
Doc at timebase query is highly selective
+CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...
+CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...
+CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...
+CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...TopDocsCollector
+CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...TopDocsCollector
+CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...TopDocsCollector
+CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...TopDocsCollector
+CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...TopDocsCollector
+CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...TopDocsCollector
Term at timedrilldown queries are highly selective
+CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...hits1missFithits1missFithits1missFithits1missF...
+CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...hits1missFithits1missFithits1missFithits2missn...
+CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...hits1missWashCathits1missFitCathits1missWashCa...
hits1missWashCathits1missFitCathits1missWashCathits1missFit Cathits2missFithits2missCat1 2 7 11 12 13 1510hits1missWashCat...
hits2missFit1 2 7 11 12 13 15108 9...hits3missnohits2missWashTopDocsCollector
TopDocsCollectorhits2missFit1 2 7 11 12 13 15108 9...hits3missnohits2missWash
TopDocsCollectorhits2missFit1 2 7 11 12 13 15108 9...hits3missnohits2missWash
Overflow
UMLhttp://www.flickr.com/photos/kristykay/2922670979/lightbox/
Custom Queries..hm what for ?
qf=STYLE TYPEdenim dress
qf=STYLE TYPEdenim dressDisjunctionMaxQuery(((STYLE:denim OR TYPE:denim) |(STYLE:dress OR TYPE:dress)))
qf=STYLE TYPEdenim dress( DisjunctionMaxQuery((STYLE:denim | TYPE:denim )))OR( DisjunctionMaxQuery((STYLE:dress | TYPE::dr...
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Writing custom queries  scorers’ diversity and traps (lucene internals)(2)
Upcoming SlideShare
Loading in...5
×

Writing custom queries scorers’ diversity and traps (lucene internals)(2)

785

Published on

Presented by Mikhail Khludnev, Grid Dynamics

Lucene has number of built-in queries, but sometimes developer needs to write own queries that might be challenging. We’ll start from the basics: learn how Lucene searches, look into few build-in queries implementations, and learn two basic approaches for query evaluation. Then I share experience which my team got when built one eCommerce Search platform, we’ll look at sample custom query or even a few ones, and talk about potential problems and caveats on that way.

Published in: Education, Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
785
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
27
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Writing custom queries scorers’ diversity and traps (lucene internals)(2)

  1. 1. WRITING CUSTOM QUERIES: SCORERSDIVERSITY AND TRAPSMikhail KhludnevPrincipal Engineer,eCommerce Search Teammkhludnev@griddynamics.comhttp://goo.gl/7LJFi
  2. 2. CustomQueries
  3. 3. CustomQueries
  4. 4. CustomQuerieshttp://nlp.stanford.edu/IR-book/
  5. 5. CustomQuerieshttp://nlp.stanford.edu/IR-book/
  6. 6. CustomQueriesMatch Spottinghttp://nlp.stanford.edu/IR-book/
  7. 7. CustomQueries
  8. 8. Inverted Index
  9. 9. T[0] = "it is what it is"T[1] = "what is it"T[2] = "it is a banana"
  10. 10. "a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}T[0] = "it is what it is"T[1] = "what is it"T[2] = "it is a banana"
  11. 11. "a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1} postings list
  12. 12. What is a Scorer?
  13. 13. "a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
  14. 14. "a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
  15. 15. "a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
  16. 16. while((doc = nextDoc())!=NO_MORE_DOCS){println("found "+ doc +" with score "+score());}
  17. 17. Note: Weight is omitted for sake of compactness
  18. 18. CustomQuerieshttp://nlp.stanford.edu/IR-book/
  19. 19. Doc-at-time search
  20. 20. "a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}what OR is OR a OR banana
  21. 21. "a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}what OR is OR a OR banana
  22. 22. "is": {0, 1, 2}"what": {0, 1}"a": {2}"banana": {2}"it": {0, 1, 2}
  23. 23. "is": {0, 1, 2}"what": {0, 1}"a": {2}"banana": {2}collect(0)score():2Collector
  24. 24. "is": {0, 1, 2}"what": {0, 1}"a": {2}"banana": {2}docID×score0×2
  25. 25. "is": {0, 1, 2}"what": {0, 1}"a": {2}"banana": {2}collect(1)score():2Collector0×2
  26. 26. "is": {0, 1, 2}"what": {0, 1}"a": {2}"banana": {2}Collector0×21×2
  27. 27. "is": {0, 1, 2}"a": {2}"banana": {2}"what": {0, 1}collect(2)score():3Collector0×21×2
  28. 28. "is": {0, 1, 2}"a": {2}"banana": {2}"what": {0, 1}Collector2×30×21×2
  29. 29. Term-at-time searchsee Appendix
  30. 30. doc at time term at timecomplexity O(p log q + n log k) O(p + n log k)memory q + k n
  31. 31. q=village operations years disaster visit etcmap seventieth peneplains tussock sirmemory character campaign author publicwonder forker middy vocalize enable raceobject signal symptom deputy where typhousrectifiable polygamous originally lookgeneration ultimately reasonably ratio numbapposing enroll manhood problem suddenlydefinitely corp event material affair diplomawould dimout speech notion engine artisthotel text field hashed rottener impeding icricket virtually valley sunday rock comeobserves gallnuts vibrantly prize involve
  32. 32. q=village operations years disaster visit
  33. 33. q=+village +operations +years +disaster +visit
  34. 34. Conjunction(+, MUST)
  35. 35. "a": {2,3}"banana": {2,3}"is": {0, 1, 2, 3}"it": {0, 1, 3}"what": {0, 1, 3}what AND is AND a AND it
  36. 36. "a": {2,3}"banana": {2,3}"is": {0, 1, 2, 3}"it": {0, 1, 3}"what": {0, 1, 3}
  37. 37. "a": {2,3}"banana": {2,3}"is": {0, 1, 2, 3}"it": {0, 1, 3}"what": {0, 1, 3}
  38. 38. "a": {2,3}"banana": {2,3}"is": {0, 1, 2, 3}"it": {0, 1, 3}"what": {0, 1, 3}
  39. 39. "a": {2,3}"banana": {2,3}"is": {0, 1, 2, 3}"it": {0, 1, 3}"what": {0, 1, 3}
  40. 40. "a": {2,3}"banana": {2,3}"is": {0, 1, 2, 3}"it": {0, 1, 3}"what": {0, 1, 3}Collector3 x 4
  41. 41. http://www.flickr.com/photos/fatniu/184615348/
  42. 42. Ω(n q + n log k)
  43. 43. Wrap-up● doc-at-time vs term-at-time● leapfrog
  44. 44. CustomQuerieshttp://nlp.stanford.edu/IR-book/
  45. 45. Custom Queries● HelloWorld● Deeply Branched vs Flat● Steadiness Problem● minShouldMatch Performance Problem● Filtering Performance Problem
  46. 46. "silver" "jeans" "dress"silver jeans dressNote: "foo bar" is not a phrase query, just a string
  47. 47. "silver" "jeans" "dress""silver jeans dress"silver jeans dress
  48. 48. "silver" "jeans" "dress""silver jeans dress""silver jeans" "dress""silver" "jeans dress"silver jeans dress
  49. 49. "silver" "jeans" "dress""silver jeans dress""silver jeans" "dress""silver" "jeans dress""silver" "dress""silver jeans" "jeans""silver jeans""jeans" "dress"silver jeans dressNote: "foo bar" is not a phrase query, just a string
  50. 50. boolean verifyMatch(){int sumLength=0;for(Scorer child:getChildren()){if(child.docID()==docID()){TermQuery tq=child.weight.query;sumLength += tq.term.text.length;}}return sumLength>=expectedLength;}
  51. 51. Deeply Branched vs Flat
  52. 52. (+"silver jeans" +"dress")ORmax(+"silver jeans dress")ORmax(+"silver" +((+"jeans" +"dress")ORmax+"jeans dress"))ORmaxis DisjunctionMaxQuery
  53. 53. (+"silver jeans" +"dress")ORmax(+"silver jeans dress")ORmax(+"silver" +((+"jeans" +"dress")ORmax+"jeans dress"))ORmaxis DisjunctionMaxQuery
  54. 54. (+"silver jeans" +"dress")ORmax(+"silver jeans dress")ORmax(+"silver" +((+"jeans" +"dress")ORmax+"jeans dress"))ORmaxis DisjunctionMaxQuery
  55. 55. ("silver jeans" "dress")ORmax("silver jeans dress")ORmax("silver" (("jeans" "dress")ORmax"jeans dress"))ORmaxis DisjunctionMaxQuery
  56. 56. B:"silver jeans dress" ORmaxT:"silver jeans dress" ORmaxS:"silver jeans dress"B:"silver" ORmaxT:"silver" ORmaxS:"silver"+B:"jeans dress" ORmaxT:"jeans dress" ORmaxS:"jeans dress"+ORmaxORmaxORmaxB:"silver jeans" ORmaxT:"silver jeans" ORmaxS:"silver jeans"+B:"dress" ORmaxT:"dress" ORmaxS:"dress"+B:"jeans" ORmaxT:"jeans" ORmaxS:"jeans"+B:"dress" ORmaxT:"dress" ORmaxS:"dress"+B - BRANDT - TYPES - STYLE
  57. 57. B:"silver" T:"silver" S:"silver"B:"jeans" T:"jeans" S:"jeans"B:"dress" T:"dress" S:"dress"B:"silver jeans" T:"silver jeans" S:"silver jeans"B:"silver jeans dress" T:"silver jeans dress"S:"silver jeans dress"B:"jeans dress" T:"jeans dress" S:"jeans dress"
  58. 58. Steadiness problemAFAIK 3.x only.
  59. 59. {1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{2,3, 27,31,..}{..., 30,37,..}33 203 30 30{..., 30, 31,32,..}{..., 20, 27,32,..}
  60. 60. {1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{2,3, 27,31,..}{..., 30,37,..}57 2027 30 30{..., 30, 31,32,..}{..., 20, 27,32,..}3docID=3.x
  61. 61. straight jeanssilver jeanssilver jeans straightjeanssilverminShouldMatch=2straight silver jeans
  62. 62. int nextDoc() {while(true) {while (subScorers[0].docID() == doc) {if (subScorers[0].nextDoc() != NO_DOCS) {heapAdjust(0);} else { ....}}...if (nrMatchers >= minimumNrMatchers) {break;}}return doc;}org.apache.lucene.search.DisjunctionSumScorer
  63. 63. {1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{ 20,27,31,..}mm=3 { 30,37,..}
  64. 64. {1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{ 20,27,31,..}mm=3 { 30,37,..}
  65. 65. {1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{ 20,27,31,..}mm=3 { 30,37,..}
  66. 66. {1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{ 20,27,31,..}mm=3 { 30,37,..}
  67. 67. {1, 3, 7, 10, 27,30,..}{3, 5, 10, 27,32,..}{ 20,27,31,..}mm=3 { 30,37,..}
  68. 68. Filtering
  69. 69. RANDOM_ACCESS_FILTER_STRATEGYLEAP_FROG_FILTER_FIRST_STRATEGYLEAP_FROG_QUERY_FIRST_STRATEGYQUERY_FIRST_FILTER_STRATEGY
  70. 70. minShouldMatch meetsFilters
  71. 71. http://localhost:8983/solr/collection1/select?q={!cache=false}village AND village operations years disaster visit etcmap seventieth peneplains tussock sir memory character campaign authorpublic wonder forker middy vocalize enable race object signal symptomdeputy where typhous rectifiable polygamous originally look generationultimately reasonably ratio numb apposing enroll manhood problemsuddenly definitely corp event material affair diploma would dimout speechnotion engine artist hotel text field hashed rottener impeding i cricketvirtually valley sunday rock come observes gallnuts vibrantly prize involveexplanation module&qf=text_all&defType=edismax&mm=32&fq= id:yes_49912894 id:nurse_30134968
  72. 72. CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge getsyou in the doorTOMORROWBreakfast starts at 7:30Keynotes start at 8:30CONTACTMikhail Khludnevmkhludnev@griddynamics.comhttp://goo.gl/7LJFi
  73. 73. Appendixes● Term-at-time search in Lucene/Solr● Derivation of the search complexity● Match Spotting● Drill Sideways Facets
  74. 74. Appendix BTerm-at-time Searchin Lucene
  75. 75. "a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}what OR is OR a OR banana
  76. 76. Accumulator... 0×1 ... 1×1 ..."a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
  77. 77. Accumulator... 0×2 ... 1×2 ... 2×1 ..."a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
  78. 78. "a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}Accumulator... 0×2 ... 1×2 ... 2×2 ...
  79. 79. "a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}Accumulator... 0x2 ... 1x2 ... 2x3 ...
  80. 80. Accumulator... 0×2 ... 1×2 ... 2×3 ...Collector2×30×21×2"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
  81. 81. BooleanScorer2
  82. 82. ×1"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}Hashtable[2]org.apache.lucene.search.BooleanScorer×10 1chunk
  83. 83. x2"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}org.apache.lucene.search.BooleanScorerx20 1chunk
  84. 84. org.apache.lucene.searchCollector0×21×2×2 ×20 1"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
  85. 85. org.apache.lucene.searchCollector0×21×2×10 1"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
  86. 86. org.apache.lucene.searchCollector0×21×2×20 1"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
  87. 87. org.apache.lucene.searchCollector0×21×2×30 1"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
  88. 88. org.apache.lucene.searchCollector2×30×21×2×30 1"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
  89. 89. new BooleanScorernew BooleanScorer2//term-at-time//doc-at-timeif ( collector.acceptsDocsOutOfOrder() &&topScorer &&required.size() == 0 &&minNrShouldMatch == 1) {else
  90. 90. Linked Open Hash [2K]×1 ×1 ×5 ×2×20 1 2 3 4 5 6 7×3
  91. 91. CollectorDocSetCollector TopDocsCollectorTopFieldCollectorTopScoreDocsCollector
  92. 92. long [952045] = { 0, 0, 0, 0, 2050, 0, 0, 8, 0, 0, 0,... }int [2079] = {4, 12, 45, 67, 103, 673, 5890, 34103,...}int [100] = {8947, 7498,1, 230, 2356, 9812, 167,....}DocSet or DocList?
  93. 93. DocList/TopDocs DocSetSizeOrdered byOut-of-ordercollectingk(numHits/rows)N(maxDocs)score orfielddocIDallows* almostcouldallow(No)
  94. 94. ?×4 6×49×5 2×42×7 7×9 1×9
  95. 95. http://www.flickr.com/photos/jbagley/4303976811/sizes/o/
  96. 96. class OutOfOrderTopScoreDocCollectorboolean acceptsDocsOutOfOrder(){ return true;}..void collect(int doc) {float score = scorer.score();...if (score == pqTop.score && doc > pqTop.doc) {...}
  97. 97. Appendix BDerivation of theSearch Complexity
  98. 98. 1×97×92×72×59×56×4......≤4......kn
  99. 99. http://en.wikipedia.org/wiki/Binary_heap
  100. 100. 6×4log k 9×5 2×42×7 7×9 1×9......≤4......n
  101. 101. qp"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}what OR is OR a OR banana
  102. 102. doc at time term at timecomplexity O(p + n log k)memory
  103. 103. qp"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}what OR is OR a OR banana11 22
  104. 104. doc at time term at timecomplexity O(p log q + n log k) O(p + n log k)memory
  105. 105. CustomQueriesMatch Spottinghttp://nlp.stanford.edu/IR-book/Appendix C
  106. 106. BRAND:"silver jeans" TYPE:"dress" STYLE:"white"BRAND:"alfani" TYPE:"dress" STYLE:"silver","jeans"BRAND:"chaloree" TYPE:"dress" STYLE:"silver"BRAND:"style&co" TYPE:"jeans dress" STYLE:"silver"BRAND:"silver jeans" TYPE:"dress" STYLE:"black"BRAND:"silver jeans" TYPE:"dress" STYLE:"white"BRAND:"silver jeans" TYPE:"jacket" STYLE: "black"BRAND:"angie" TYPE:"dress" STYLE:"silver","jeans"BRAND:"chaloree" TYPE:"jeans dress" STYLE:"silver"BRAND:"silver jeans" TYPE:"dress" STYLE:"blue"BRAND:"dotty" TYPE:"dress" STYLE:"silver","jeans"BRAND:"chaloree" STYLE:"jeans" "dress"
  107. 107. BRAND:"silver jeans" TYPE:"dress" STYLE:"white"BRAND:"alfani" TYPE:"dress" STYLE:"silver","jeans"BRAND:"chaloree" TYPE:"dress" STYLE:"silver"BRAND:"style&co" TYPE:"jeans dress" STYLE:"silver"BRAND:"silver jeans" TYPE:"dress" STYLE:"black"BRAND:"silver jeans" TYPE:"dress" STYLE:"white"BRAND:"silver jeans" TYPE:"jacket" STYLE: "black"BRAND:"angie" TYPE:"dress" STYLE:"silver","jeans"BRAND:"chaloree" TYPE:"jeans dress" STYLE:"silver"BRAND:"silver jeans" TYPE:"dress" STYLE:"blue"BRAND:"dotty" TYPE:"dress" STYLE:"silver","jeans"BRAND:"chaloree" STYLE:"jeans" "dress"silver jeans dress
  108. 108. BRAND:"silver jeans" TYPE:"dress" STYLE:"white"BRAND:"alfani" TYPE:"dress" STYLE:"silver","jeans"BRAND:"chaloree" TYPE:"dress" STYLE:"silver"BRAND:"style&co" TYPE:"jeans dress" STYLE:"silver"BRAND:"silver jeans" TYPE:"dress" STYLE:"black"BRAND:"silver jeans" TYPE:"dress" STYLE:"white"BRAND:"silver jeans" TYPE:"jacket" STYLE: "black"BRAND:"angie" TYPE:"dress" STYLE:"silver","jeans"BRAND:"chaloree" TYPE:"jeans dress" STYLE:"silver"BRAND:"silver jeans" TYPE:"dress" STYLE:"blue"BRAND:"dotty" TYPE:"dress" STYLE:"silver","jeans"BRAND:"chaloree" STYLE:"jeans" "dress"
  109. 109. BRAND:"silver jeans" TYPE:"dress"TYPE:"dress" STYLE:"silver","jeans"TYPE:"jeans dress" STYLE:"silver"BRAND:"silver jeans" TYPE:"dress"BRAND:"silver jeans" TYPE:"dress"TYPE:"dress" STYLE:"silver","jeans"TYPE:"jeans dress" STYLE:"silver"BRAND:"silver jeans" TYPE:"dress"TYPE:"dress" STYLE:"silver","jeans"
  110. 110. BRAND:"silver jeans" TYPE:"dress"TYPE:"dress" STYLE:"silver","jeans"TYPE:"jeans dress" STYLE:"silver"BRAND:"silver jeans" TYPE:"dress"BRAND:"silver jeans" TYPE:"dress"TYPE:"dress" STYLE:"silver","jeans"TYPE:"jeans dress" STYLE:"silver"BRAND:"silver jeans" TYPE:"dress"TYPE:"dress" STYLE:"silver","jeans"
  111. 111. BRAND:"silver jeans" TYPE:"dress" (4)TYPE:"dress" STYLE:"silver","jeans"TYPE:"jeans dress" STYLE:"silver"TYPE:"dress" STYLE:"silver","jeans"TYPE:"jeans dress" STYLE:"silver"TYPE:"dress" STYLE:"silver","jeans"
  112. 112. BRAND:"silver jeans" TYPE:"dress" (4)TYPE:"dress" STYLE:"silver","jeans"TYPE:"jeans dress" STYLE:"silver"TYPE:"dress" STYLE:"silver","jeans"TYPE:"jeans dress" STYLE:"silver"TYPE:"dress" STYLE:"silver","jeans"
  113. 113. BRAND:"silver jeans" TYPE:"dress" (4)TYPE:"dress" STYLE:"silver","jeans" (3)TYPE:"jeans dress" STYLE:"silver"TYPE:"jeans dress" STYLE:"silver"
  114. 114. BRAND:"silver jeans" TYPE:"dress" (4)TYPE:"dress" STYLE:"silver","jeans" (3)TYPE:"jeans dress" STYLE:"silver" (2)
  115. 115. BRAND:"silver jeans" TYPE:"dress" (4)TYPE:"dress" STYLE:"silver","jeans" (3)TYPE:"jeans dress" STYLE:"silver" (2)silver jeans dress
  116. 116. BRAND:"silver jeans" TYPE:"dress" (4)TYPE:"dress" STYLE:"silver","jeans" (3)TYPE:"jeans dress" STYLE:"silver" (2)silver jeans dress
  117. 117. Appendix DDrill Sideways Facets
  118. 118. +CATEGORY: Denim+FIT: Straight+WASH: Dark&B
  119. 119. +CATEGORY: Denim+FIT: Straight+WASH: Dark&B+CATEGORY: Denim+WASH: Dark&B
  120. 120. +CATEGORY: Denim+FIT: Straight+WASH: Dark&B+CATEGORY: Denim+WASH: Dark&B+CATEGORY: Denim+FIT: Straight
  121. 121. +CATEGORY: DenimFIT: StraightWASH: Dark&Black.../minShouldMatch=Ndrilldowns-1
  122. 122. +CAT: DenimFIT: StraightWASH: Dark
  123. 123. +CAT: DenimFIT: StraightWASH: DarktotalHits3near miss2near miss2
  124. 124. +CAT: DenimFIT: StraightWASH: DarktotalHits3near miss2near miss2
  125. 125. +CAT: DenimFIT: StraightWASH: DarktotalHits3near miss2near miss2
  126. 126. Doc at timebase query is highly selective
  127. 127. +CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...
  128. 128. +CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...
  129. 129. +CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...
  130. 130. +CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...TopDocsCollector
  131. 131. +CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...TopDocsCollector
  132. 132. +CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...TopDocsCollector
  133. 133. +CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...TopDocsCollector
  134. 134. +CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...TopDocsCollector
  135. 135. +CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...TopDocsCollector
  136. 136. Term at timedrilldown queries are highly selective
  137. 137. +CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...hits1missFithits1missFithits1missFithits1missFithits1missFit1 2 7 11 12 13 15108 9...
  138. 138. +CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...hits1missFithits1missFithits1missFithits2missno1 2 7 11 12 13 1510hits1missWashhits1missWash8 9...hits1missWashhits2missnohits1missWash
  139. 139. +CAT:D..{1, 7, 9, 15 }FIT:S.. {2, 7, 8, 9, 10,12}WASH:D..{2, 7, 11,13,15}...hits1missWashCathits1missFitCathits1missWashCathits1missFit Cathits2missFithits2missCat1 2 7 11 12 13 1510hits1missWashCat8 9...hits3misshits2missWash
  140. 140. hits1missWashCathits1missFitCathits1missWashCathits1missFit Cathits2missFithits2missCat1 2 7 11 12 13 1510hits1missWashCat8 9...hits3missnohits2missWash
  141. 141. hits2missFit1 2 7 11 12 13 15108 9...hits3missnohits2missWashTopDocsCollector
  142. 142. TopDocsCollectorhits2missFit1 2 7 11 12 13 15108 9...hits3missnohits2missWash
  143. 143. TopDocsCollectorhits2missFit1 2 7 11 12 13 15108 9...hits3missnohits2missWash
  144. 144. Overflow
  145. 145. UMLhttp://www.flickr.com/photos/kristykay/2922670979/lightbox/
  146. 146. Custom Queries..hm what for ?
  147. 147. qf=STYLE TYPEdenim dress
  148. 148. qf=STYLE TYPEdenim dressDisjunctionMaxQuery(((STYLE:denim OR TYPE:denim) |(STYLE:dress OR TYPE:dress)))
  149. 149. qf=STYLE TYPEdenim dress( DisjunctionMaxQuery((STYLE:denim | TYPE:denim )))OR( DisjunctionMaxQuery((STYLE:dress | TYPE::dress )))

×