Your SlideShare is downloading. ×
0
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Dawid Weiss- Finite state automata in lucene

3,166

Published on

Finite state automata and transducers made it into Lucene fairly recently, but already show a very promising impact on search performance. This data structure is rarely exploited because it is …

Finite state automata and transducers made it into Lucene fairly recently, but already show a very promising impact on search performance. This data structure is rarely exploited because it is commonly (and unfairly) associated with high complexity. During the talk, I will try to show that automata and transducers are in fact very simple, their construction can be very efficient (memory and time-wise) and their field of applications very broad.

0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,166
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
64
Comments
0
Likes
11
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Finite State Automata in Dawid WEISS
  • 2. . Dawid Weiss . 20+ years of coding 10 years assembly only . Academia & Research PhD in Information Retrieval, PUT Open source Carrot2 , HPPC, Lucene, … Industry & Business Carrot Search s.c.. .
  • 3. Talk outlineState machines (automata)FSAs, DFAs, FSTs and other XXXs.Use cases in Lucene and SolrSuggester. FuzzySearch. Index.No API detailsStill @experimental.
  • 4. (Non)? Deterministic FiniteState (Automata|Machines)
  • 5. HashSethash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → lucifer
  • 6. HashSethash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → luciferFSA (deterministic) l u c e n e i d r f e
  • 7. HashSethash → slot → value0x29384d34 → lucene0xde3e3354 → lucid0x00000666 → luciferFSA (deterministic) l u c e n e i dexists(sequence) r oor(pre x) fceil(pre x) e
  • 8. k i l lb l deterministic, non-minimal i l
  • 9. k i l lb l deterministic, non-minimal i lk i l l deterministic, minimalb
  • 10. k i l lb l deterministic, non-minimal i lk i l l deterministic, minimalbk i l l non-deterministic, i l non-minimalb
  • 11. (Sorted)Maplucene → 1lucid → 2lucifer → 666
  • 12. (Sorted)Maplucene → 1lucid → 2lucifer → 666FST (transducer) l u c e n e|1 i d|2 r|666 f e
  • 13. (Sorted)Maplucene → 1lucid → 2lucifer → 666FST (transducer) l|1 u c e n e i|1 d r f|664 e
  • 14. NFSAs andRegular expressions a a e1e2 e1 e1Determinization e+ estates explosion, not always possibleBacktrackingrecursion explosion e* e e? e
  • 15. a?nan
  • 16. a?nann=3 → a?a?a?aaa
  • 17. a?nan n=3 → a?a?a?aaaSource: Russ Cox, Regular Expression Matching Can Be Simple And Fast (re2).
  • 18. 35000 30000 25000Time [ms] 20000 15000 10000 5000 0 0 5 10 15 20 25 30 Time of matching an for pattern a?n an , depending on n. Java 1.6, modern hardware.
  • 19. Linear-time, minimal, deterministicFSA constructionLinear algorithm from sorted inputby Daciuk, Mihov, et al.Active pathstates that still can changeStates dictionarynodes that will never change
  • 20. 1) common AP pre x2) freeze the rest of AP3) add suffix → new APlucene
  • 21. 1) common AP pre x2) freeze the rest of AP3) add suffix → new AP l u c e n elucid
  • 22. 1) common AP pre x2) freeze the rest of AP3) add suffix → new AP l u c e n e i d
  • 23. 1) common AP pre x2) freeze the rest of AP3) add suffix → new AP l u c e n e i dlucifer
  • 24. 1) common AP pre x2) freeze the rest of AP3) add suffix → new AP l u c e n e i d f e r
  • 25. 1) common AP pre x2) freeze the rest of AP3) add suffix → new AP l u c e n e i d r f e
  • 26. FS(A|T)s in (Lucene|Solr)
  • 27. Automata inLucene|Solrorg.apache.lucene.util.automaton.*partial port of brics, FuzzyQuery, AutomatonTermsEnumorg.apache.lucene.util.automaton.fst.FSTFSA and FSTs from sorted data, suggester, indexes
  • 28. org.apache.lucene.util.automaton.fst.*FSA representationArc-based, not state-basedMoore vs. Mealy. Compact vs. intuitive Input: abc, bd, bde. a b c a b c b d b e d e d
  • 29. org.apache.lucene.util.automaton.fst.*FSA representation a b cArc-based, not state-based s3 s2 s1Moore vs. Mealy. Compact vs. intuitive b d e s4 s5Next-state chainingrequires unusual tricks during construction s1 s2 s5 s4 s3 cFL bL eFL dL a bL s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  • 30. org.apache.lucene.util.automaton.fst.*FSA representation a b cArc-based, not state-based s3 s2 s1Moore vs. Mealy. Compact vs. intuitive b d e s4 s5Next-state chainingrequires unusual tricks during construction s1 s2 s5 s4 s3Everything in a byte[] cFL bL eFL dL a bLtraversals-ready, memory-efficient s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  • 31. org.apache.lucene.util.automaton.fst.*FSA representation a b cArc-based, not state-based s3 s2 s1Moore vs. Mealy. Compact vs. intuitive b d e s4 s5Next-state chainingrequires unusual tricks during construction s1 s2 s5 s4 s3Everything in a byte[] cFL bL eFL dL a bLtraversals-ready, memory-efficientDual transition storage formatlookup: bsearch or linear scan s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  • 32. Input size Compressed size (MB)Input MB Terms Lucene morf. gzipWikipedia t.index 481 38 092 045 258 164 149Polish in . 162 3 672 200 3.1 1.7 15.4 .
  • 33. Use Cases:Solrs Autocomplete
  • 34. SolrsSuggestersDesign choicessort order (alpha, score), pre x vs. spelling, boost exact matches?Weightsterm→weight, lookup(term, onlyMorePopular)org.apache.solr.spelling.suggest.LookupJaspellLookup, TSTLookup, FSTLookup
  • 35. flour|3 four|4 fourier|3 furious|2 . . Take 1 ..
  • 36. flour|3 four|4 →fou* fourier|3 furious|2 o u l i e r | f o u r | 3 u 4 r 2 i o u s | Find pre x. Depth-in traversal for completions. PQ on score|alpha . . Take 1 ..
  • 37. 2furious 3flour 3fourier 4four . . Take 2 ..
  • 38. 2furious 3flour →fou* 3fourier 4four u f r i o u 2 s l 3 f u r i e r o 4 u f o From score roots, until N collected. Find pre x. Depth-in traversal for completions, stop if N collected. Find/boost exact match. . . Take 2 ..
  • 39. 2furious 5urious|furious 5rious|furious 5ious|furious 5ous|furious 5us|furious 5s|furious 3flour … . Take 3 (in xes) . ..
  • 40. 2 i o o u i r s r s | f 5 u s u 4 o u 6 r u r r | . i o u s f 7 u r o l f o i e 3 l u r e f o | f r | r i l o u e i o | | r i u r u.
  • 41. Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length.
  • 42. Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length. Exact matches only. Static snapshot (not incremental). Discretized weights.
  • 43. Top50KWiki.utf8, 676 KB, 50 000 terms Jaspell TST FST . RAM [B] . 7 869 415 . 7 914 524 300 .175 queries per second,. . . tpq .PREFIX [100-200] . 458 . 966 . 742 . PREFIX [6-9] . 330 . 228 . 659 . PREFIX [2-4] . 126 . 29 . . 501
  • 44. Summary
  • 45. Summary and ConclusionsAutomatacompact, powerful, efficient data structureLucene/Solr bene tsbehind the scenes, but spreading: index, queries, suggestersAPI in Lucene…is shaped right now, still @experimental
  • 46. AcknowledgementMichael McCandlessRobert Muircommitter: .+
  • 47. dawid.weiss@carrotsearch.com

×