Dawid Weiss- Finite state automata in lucene

5,684 views

Published on

Finite state automata and transducers made it into Lucene fairly recently, but already show a very promising impact on search performance. This data structure is rarely exploited because it is commonly (and unfairly) associated with high complexity. During the talk, I will try to show that automata and transducers are in fact very simple, their construction can be very efficient (memory and time-wise) and their field of applications very broad.

Dawid Weiss- Finite state automata in lucene

  1. 1. Finite State Automata in Dawid WEISS
  2. 2. . Dawid Weiss . 20+ years of coding 10 years assembly only . Academia & Research PhD in Information Retrieval, PUT Open source Carrot2 , HPPC, Lucene, … Industry & Business Carrot Search s.c.. .
  3. 3. Talk outlineState machines (automata)FSAs, DFAs, FSTs and other XXXs.Use cases in Lucene and SolrSuggester. FuzzySearch. Index.No API detailsStill @experimental.
  4. 4. (Non)? Deterministic FiniteState (Automata|Machines)
  5. 5. HashSethash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer
  6. 6. HashSethash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → luciferFSA (deterministic) l u c e n e i d r f e
  7. 7. HashSethash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → luciferFSA (deterministic) l u c e n e i dexists(sequence) r oor(pre x) fceil(pre x) e
  8. 8. k i l lb l deterministic, non-minimal i l
  9. 9. k i l lb l deterministic, non-minimal i lk i l l deterministic, minimalb
  10. 10. k i l lb l deterministic, non-minimal i lk i l l deterministic, minimalbk i l l non-deterministic, i l non-minimalb
  11. 11. (Sorted)Maplucene → 1lucid → 2lucifer → 666
  12. 12. (Sorted)Maplucene → 1lucid → 2lucifer → 666FST (transducer) l u c e n e|1 i d|2 r|666 f e
  13. 13. (Sorted)Maplucene → 1lucid → 2lucifer → 666FST (transducer) l|1 u c e n e i|1 d r f|664 e
  14. 14. NFSAs andRegular expressions a a e1e2 e1 e1Determinization e+ estates explosion, not always possibleBacktrackingrecursion explosion e* e e? e
  15. 15. a?nan
  16. 16. a?nann=3 → a?a?a?aaa
  17. 17. a?nan n=3 → a?a?a?aaaSource: Russ Cox, Regular Expression Matching Can Be Simple And Fast (re2).
  18. 18. 35000 30000 25000Time [ms] 20000 15000 10000 5000 0 0 5 10 15 20 25 30 Time of matching an for pattern a?n an , depending on n. Java 1.6, modern hardware.
  19. 19. Linear-time, minimal, deterministicFSA constructionLinear algorithm from sorted inputby Daciuk, Mihov, et al.Active pathstates that still can changeStates dictionarynodes that will never change
  20. 20. 1) common AP pre x2) freeze the rest of AP3) add suffix → new APlucene
  21. 21. 1) common AP pre x2) freeze the rest of AP3) add suffix → new AP l u c e n elucid
  22. 22. 1) common AP pre x2) freeze the rest of AP3) add suffix → new AP l u c e n e i d
  23. 23. 1) common AP pre x2) freeze the rest of AP3) add suffix → new AP l u c e n e i dlucifer
  24. 24. 1) common AP pre x2) freeze the rest of AP3) add suffix → new AP l u c e n e i d f e r
  25. 25. 1) common AP pre x2) freeze the rest of AP3) add suffix → new AP l u c e n e i d r f e
  26. 26. FS(A|T)s in (Lucene|Solr)
  27. 27. Automata inLucene|Solrorg.apache.lucene.util.automaton.*partial port of brics, FuzzyQuery, AutomatonTermsEnumorg.apache.lucene.util.automaton.fst.FSTFSA and FSTs from sorted data, suggester, indexes
  28. 28. org.apache.lucene.util.automaton.fst.*FSA representationArc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive Input: abc, bd, bde. a b c a b c b d b e d e d
  29. 29. org.apache.lucene.util.automaton.fst.*FSA representation a b cArc-based, not state-based s3 s2 s1Moore vs. Mealy. Compact vs. intuitive b d e s4 s5Next-state chainingrequires unusual tricks during construction s1 s2 s5 s4 s3 cFL bL eFL dL a bL s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  30. 30. org.apache.lucene.util.automaton.fst.*FSA representation a b cArc-based, not state-based s3 s2 s1Moore vs. Mealy. Compact vs. intuitive b d e s4 s5Next-state chainingrequires unusual tricks during construction s1 s2 s5 s4 s3Everything in a byte[] cFL bL eFL dL a bLtraversals-ready, memory-efficient s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  31. 31. org.apache.lucene.util.automaton.fst.*FSA representation a b cArc-based, not state-based s3 s2 s1Moore vs. Mealy. Compact vs. intuitive b d e s4 s5Next-state chainingrequires unusual tricks during construction s1 s2 s5 s4 s3Everything in a byte[] cFL bL eFL dL a bLtraversals-ready, memory-efficientDual transition storage formatlookup: bsearch or linear scan s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  32. 32. Input size Compressed size (MB)Input MB Terms Lucene morf. gzipWikipedia t.index 481 38 092 045 258 164 149Polish in . 162 3 672 200 3.1 1.7 15.4 .
  33. 33. Use Cases:Solrs Autocomplete
  34. 34. SolrsSuggestersDesign choicessort order (alpha, score), pre x vs. spelling, boost exact matches?Weightsterm→weight, lookup(term, onlyMorePopular)org.apache.solr.spelling.suggest.LookupJaspellLookup, TSTLookup, FSTLookup
  35. 35. flour|3 four|4 fourier|3 furious|2 . . Take 1 ..
  36. 36. flour|3 four|4 →fou* fourier|3 furious|2 o u l i e r | f o u r | 3 u 4 r 2 i o u s | Find pre x. Depth-in traversal for completions. PQ on score|alpha . . Take 1 ..
  37. 37. 2furious 3flour 3fourier 4four . . Take 2 ..
  38. 38. 2furious 3flour →fou* 3fourier 4four u f r i o u 2 s l 3 f u r i e r o 4 u f o From score roots, until N collected. Find pre x. Depth-in traversal for completions, stop if N collected. Find/boost exact match. . . Take 2 ..
  39. 39. 2furious 5urious|furious 5rious|furious 5ious|furious 5ous|furious 5us|furious 5s|furious 3flour … . Take 3 (in xes) . ..
  40. 40. 2 i o o u i r s r s | f 5 u s u 4 o u 6 r u r r | . i o u s f 7 u r o l f o i e 3 l u r e f o | f r | r i l o u e i o | | r i u r u.
  41. 41. Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length.
  42. 42. Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length. Exact matches only. Static snapshot (not incremental). Discretized weights.
  43. 43. Top50KWiki.utf8, 676 KB, 50 000 terms Jaspell TST FST . RAM [B] . 7 869 415 . 7 914 524 300 .175 queries per second,. . . tpq .PREFIX [100-200] . 458 . 966 . 742 . PREFIX [6-9] . 330 . 228 . 659 . PREFIX [2-4] . 126 . 29 . . 501
  44. 44. Summary
  45. 45. Summary and ConclusionsAutomatacompact, powerful, efficient data structureLucene/Solr bene tsbehind the scenes, but spreading: index, queries, suggestersAPI in Lucene…is shaped right now, still @experimental
  46. 46. AcknowledgementMichael McCandlessRobert Muircommitter: .+
  47. 47. dawid.weiss@carrotsearch.com

×