SlideShare a Scribd company logo
1 of 47
Download to read offline
Finite State Automata
   in
       Dawid WEISS
.



            Dawid Weiss
        .
            20+ years of coding
            10 years assembly only

    .       Academia & Research
            PhD in Information Retrieval, PUT

            Open source
            Carrot2 , HPPC, Lucene, …

            Industry & Business
            Carrot Search s.c.




.       .
Talk outline

State machines (automata)
FSAs, DFAs, FSTs and other XXXs.

Use cases in Lucene and Solr
Suggester. FuzzySearch. Index.

No API details
Still @experimental.
(Non)? Deterministic Finite
State (Automata|Machines)
HashSet
hash         → slot   → value
0x29384d34            → lucene
0xde3e3354            → lucid
0x00000666            → lucifer
HashSet
hash           → slot       → value
0x29384d34                  → lucene
0xde3e3354                  → lucid
0x00000666                  → lucifer

FSA (deterministic)
          l      u      c       e       n       e
                                 i
                                            d
                                                    r
                                        f
                                                e
HashSet
hash           → slot       → value
0x29384d34                  → lucene
0xde3e3354                  → lucid
0x00000666                  → lucifer

FSA (deterministic)
          l        u    c       e       n       e
                                 i
                                            d
exists(sequence)                                    r
 oor(pre x)                             f
ceil(pre x)                                     e
k   i   l   l

b           l   deterministic, non-minimal
    i   l
k   i   l   l

b           l   deterministic, non-minimal
    i   l



k
    i   l   l
                deterministic, minimal
b
k   i   l    l

b            l   deterministic, non-minimal
    i   l



k
    i   l    l
                 deterministic, minimal
b


k
    i    l
             l   non-deterministic,
    i    l
                 non-minimal
b
(Sorted)Map

lucene    → 1
lucid     → 2
lucifer   → 666
(Sorted)Map

lucene        → 1
lucid         → 2
lucifer       → 666

FST (transducer)
          l        u   c   e   n         e|1
                           i
                                   d|2
                                             r|666
                               f
                                         e
(Sorted)Map

lucene     → 1
lucid      → 2
lucifer    → 666

FST (transducer)
         l|1       u   c   e     n           e
                           i|1
                                     d
                                                 r
                                 f|664
                                         e
NFSAs and
Regular expressions
                                                    a
                                          a


                                        e1e2   e1           e1



Determinization                          e+
                                                        e
states explosion, not always possible

Backtracking
recursion explosion                      e*
                                                        e



                                         e?
                                                        e
a?nan
a?nan
n=3 → a?a?a?aaa
a?nan
                      n=3 → a?a?a?aaa




Source: Russ Cox, Regular Expression Matching Can Be Simple And Fast (re2).
35000


            30000


            25000
Time [ms]




            20000


            15000


            10000


             5000


               0
                    0                 5              10               15               20              25     30




                        Time of matching an for pattern a?n an , depending on n. Java 1.6, modern hardware.
Linear-time, minimal, deterministic
FSA construction

Linear algorithm from sorted input
by Daciuk, Mihov, et al.

Active path
states that still can change

States dictionary
nodes that will never change
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




lucene
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




        l    u     c       e   n   e




lucid
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




       l     u     c       e   n   e
                           i
                               d
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




          l   u    c       e   n   e
                           i
                               d




lucifer
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




       l     u     c       e   n       e
                           i
                                   d

                               f
                                       e   r
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




       l     u     c       e   n       e
                           i
                                   d
                                           r
                               f
                                       e
FS(A|T)s in (Lucene|Solr)
Automata in
Lucene|Solr

org.apache.lucene.util.automaton.*
partial port of brics, FuzzyQuery, AutomatonTermsEnum

org.apache.lucene.util.automaton.fst.FST
FSA and FSTs from sorted data, suggester, indexes
org.apache.lucene.util.automaton.fst.*
FSA representation

Arc-based, not state-based
Moore vs. Mealy. Compact vs. intuitive




              Input: abc, bd, bde.
                      a      b       c       a   b   c

                      b          d           b
                                         e       d   e
                              d
org.apache.lucene.util.automaton.fst.*
FSA representation
                                                   a        b        c
Arc-based, not state-based                    s3       s2       s1
Moore vs. Mealy. Compact vs. intuitive             b        d        e
                                                       s4       s5
Next-state chaining
requires unusual tricks during construction
                                              s1 s2 s5 s4            s3
                                              cFL bL eFL dL      a   bL




                                              s1 s2 s5 s4            s3
                                              cFL bL eFL dL bLN a
org.apache.lucene.util.automaton.fst.*
FSA representation
                                                   a        b        c
Arc-based, not state-based                    s3       s2       s1
Moore vs. Mealy. Compact vs. intuitive             b        d        e
                                                       s4       s5
Next-state chaining
requires unusual tricks during construction
                                              s1 s2 s5 s4            s3
Everything in a byte[]                        cFL bL eFL dL      a   bL
traversals-ready, memory-efficient


                                              s1 s2 s5 s4            s3
                                              cFL bL eFL dL bLN a
org.apache.lucene.util.automaton.fst.*
FSA representation
                                                   a        b        c
Arc-based, not state-based                    s3       s2       s1
Moore vs. Mealy. Compact vs. intuitive             b        d        e
                                                       s4       s5
Next-state chaining
requires unusual tricks during construction
                                              s1 s2 s5 s4            s3
Everything in a byte[]                        cFL bL eFL dL      a   bL
traversals-ready, memory-efficient

Dual transition storage format
lookup: bsearch or linear scan                s1 s2 s5 s4            s3
                                              cFL bL eFL dL bLN a
Input size       Compressed size (MB)
Input               MB        Terms    Lucene    morf.   gzip
Wikipedia t.index   481   38 092 045      258     164    149
Polish in .         162    3 672 200       3.1     1.7   15.4




        .
Use Cases:
Solr's Autocomplete
Solr's
Suggesters

Design choices
sort order (alpha, score), pre x vs. spelling, boost exact matches?

Weights
term→weight, lookup(term, onlyMorePopular)

org.apache.solr.spelling.suggest.Lookup
JaspellLookup, TSTLookup, FSTLookup
flour|3
    four|4
    fourier|3
    furious|2




                .
                .
                Take 1 .
.
flour|3
           four|4
                                          →fou*
           fourier|3
           furious|2



                       o             u
                 l
                                     i     e      r       |
           f     o     u      r
                                     |                        3
                 u                                    4

                        r                                     2
                              i      o     u      s       |



    Find pre x.
    Depth-in traversal for completions.
    PQ on score|alpha
                                                                  .
                                                                  .
                                                                  Take 1 .
.
2furious
    3flour
    3fourier
    4four




               .
               .
               Take 2 .
.
2furious
           3flour
                                              →fou*
           3fourier
           4four


                            u
                    f              r
                                          i       o      u
            2                                                  s
                            l
            3       f             u       r       i      e     r
                            o
            4                                     u
                        f          o



    From score roots, until N collected.
    Find pre x.
    Depth-in traversal for completions, stop if N collected.
    Find/boost exact match.                                        .
                                                                   .
                                                                   Take 2 .
.
2furious
    5urious|furious
    5rious|furious
    5ious|furious
    5ous|furious
    5us|furious
    5s|furious
    3flour
    …




                      .
                      Take 3 (in xes) .
                      .
.
2



                i

                    o
                                o
                                        u
                        i
            r                                   s
                r           s                           |
                                                            f

    5   u

                                s
                                                                u


                4

        o       u
    6                                                               r
            u
                r
                        r       |                   .                   i   o   u
                                                                                    s
                                                f
    7                                                                   u           r
                                                            o
                                                        l
                                            f               o               i   e
                3                                       l           u   r

                        e                       f               o
                                        |
                                                            f
                                r                       |
                                                r
                    i
        l       o       u               e
                                i
        o                       |   |
                r
                                i
                u       r
            u
.
Constant time lookups!
  Regardless of the terms dictionary size.
       Regardless of pre x length.
Constant time lookups!
  Regardless of the terms dictionary size.
       Regardless of pre x length.


            Exact matches only.
    Static snapshot (not incremental).
            Discretized weights.
Top50KWiki.utf8, 676 KB, 50 000 terms

                    Jaspell        TST           FST
       .
    RAM [B]            .
                   7 869 415         .
                                7 914 524      300 .175



                       queries per second,. . . tpq
        .
PREFIX [100-200]      .
                     458             .
                                   966              .
                                                  742


       .
  PREFIX [6-9]        .
                     330            .
                                   228            .
                                                 659


       .
  PREFIX [2-4]        .
                     126            .
                                   29
                                    .             .
                                                 501
Summary
Summary and Conclusions
Automata
compact, powerful, efficient data structure

Lucene/Solr bene ts
behind the scenes, but spreading: index, queries, suggesters

API in Lucene
…is shaped right now, still @experimental
Acknowledgement

Michael McCandless

Robert Muir

committer: .+
dawid.weiss@carrotsearch.com

More Related Content

What's hot

InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxData
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Umesh Prasad
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into ElasticsearchKnoldus Inc.
 
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...Monica Beckwith
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...InfluxData
 

What's hot (20)

InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Vacuum徹底解説
Vacuum徹底解説Vacuum徹底解説
Vacuum徹底解説
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
これがCassandra
これがCassandraこれがCassandra
これがCassandra
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Yahoo! JAPANにおけるApache Cassandraへの取り組み
Yahoo! JAPANにおけるApache Cassandraへの取り組みYahoo! JAPANにおけるApache Cassandraへの取り組み
Yahoo! JAPANにおけるApache Cassandraへの取り組み
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
Glibc malloc internal
Glibc malloc internalGlibc malloc internal
Glibc malloc internal
 
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
 
いまさら聞けないPostgreSQL運用管理
いまさら聞けないPostgreSQL運用管理いまさら聞けないPostgreSQL運用管理
いまさら聞けないPostgreSQL運用管理
 

Viewers also liked

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadooplucenerevolution
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
Lucandra
LucandraLucandra
Lucandraotisg
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...Lucidworks
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 

Viewers also liked (20)

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Automata Invasion
Automata InvasionAutomata Invasion
Automata Invasion
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Lucene
LuceneLucene
Lucene
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Lucandra
LucandraLucandra
Lucandra
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 

More from Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucidworks (Archived)
 

More from Lucidworks (Archived) (20)

The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
 

Dawid Weiss- Finite state automata in lucene

  • 1. Finite State Automata in Dawid WEISS
  • 2. . Dawid Weiss . 20+ years of coding 10 years assembly only . Academia & Research PhD in Information Retrieval, PUT Open source Carrot2 , HPPC, Lucene, … Industry & Business Carrot Search s.c. . .
  • 3. Talk outline State machines (automata) FSAs, DFAs, FSTs and other XXXs. Use cases in Lucene and Solr Suggester. FuzzySearch. Index. No API details Still @experimental.
  • 4. (Non)? Deterministic Finite State (Automata|Machines)
  • 5. HashSet hash → slot → value 0x29384d34 → lucene 0xde3e3354 → lucid 0x00000666 → lucifer
  • 6. HashSet hash → slot → value 0x29384d34 → lucene 0xde3e3354 → lucid 0x00000666 → lucifer FSA (deterministic) l u c e n e i d r f e
  • 7. HashSet hash → slot → value 0x29384d34 → lucene 0xde3e3354 → lucid 0x00000666 → lucifer FSA (deterministic) l u c e n e i d exists(sequence) r oor(pre x) f ceil(pre x) e
  • 8. k i l l b l deterministic, non-minimal i l
  • 9. k i l l b l deterministic, non-minimal i l k i l l deterministic, minimal b
  • 10. k i l l b l deterministic, non-minimal i l k i l l deterministic, minimal b k i l l non-deterministic, i l non-minimal b
  • 11. (Sorted)Map lucene → 1 lucid → 2 lucifer → 666
  • 12. (Sorted)Map lucene → 1 lucid → 2 lucifer → 666 FST (transducer) l u c e n e|1 i d|2 r|666 f e
  • 13. (Sorted)Map lucene → 1 lucid → 2 lucifer → 666 FST (transducer) l|1 u c e n e i|1 d r f|664 e
  • 14. NFSAs and Regular expressions a a e1e2 e1 e1 Determinization e+ e states explosion, not always possible Backtracking recursion explosion e* e e? e
  • 15. a?nan
  • 17. a?nan n=3 → a?a?a?aaa Source: Russ Cox, Regular Expression Matching Can Be Simple And Fast (re2).
  • 18. 35000 30000 25000 Time [ms] 20000 15000 10000 5000 0 0 5 10 15 20 25 30 Time of matching an for pattern a?n an , depending on n. Java 1.6, modern hardware.
  • 19. Linear-time, minimal, deterministic FSA construction Linear algorithm from sorted input by Daciuk, Mihov, et al. Active path states that still can change States dictionary nodes that will never change
  • 20. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP lucene
  • 21. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e lucid
  • 22. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d
  • 23. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d lucifer
  • 24. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d f e r
  • 25. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d r f e
  • 27. Automata in Lucene|Solr org.apache.lucene.util.automaton.* partial port of brics, FuzzyQuery, AutomatonTermsEnum org.apache.lucene.util.automaton.fst.FST FSA and FSTs from sorted data, suggester, indexes
  • 28. org.apache.lucene.util.automaton.fst.* FSA representation Arc-based, not state-based Moore vs. Mealy. Compact vs. intuitive Input: abc, bd, bde. a b c a b c b d b e d e d
  • 29. org.apache.lucene.util.automaton.fst.* FSA representation a b c Arc-based, not state-based s3 s2 s1 Moore vs. Mealy. Compact vs. intuitive b d e s4 s5 Next-state chaining requires unusual tricks during construction s1 s2 s5 s4 s3 cFL bL eFL dL a bL s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  • 30. org.apache.lucene.util.automaton.fst.* FSA representation a b c Arc-based, not state-based s3 s2 s1 Moore vs. Mealy. Compact vs. intuitive b d e s4 s5 Next-state chaining requires unusual tricks during construction s1 s2 s5 s4 s3 Everything in a byte[] cFL bL eFL dL a bL traversals-ready, memory-efficient s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  • 31. org.apache.lucene.util.automaton.fst.* FSA representation a b c Arc-based, not state-based s3 s2 s1 Moore vs. Mealy. Compact vs. intuitive b d e s4 s5 Next-state chaining requires unusual tricks during construction s1 s2 s5 s4 s3 Everything in a byte[] cFL bL eFL dL a bL traversals-ready, memory-efficient Dual transition storage format lookup: bsearch or linear scan s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  • 32. Input size Compressed size (MB) Input MB Terms Lucene morf. gzip Wikipedia t.index 481 38 092 045 258 164 149 Polish in . 162 3 672 200 3.1 1.7 15.4 .
  • 34. Solr's Suggesters Design choices sort order (alpha, score), pre x vs. spelling, boost exact matches? Weights term→weight, lookup(term, onlyMorePopular) org.apache.solr.spelling.suggest.Lookup JaspellLookup, TSTLookup, FSTLookup
  • 35. flour|3 four|4 fourier|3 furious|2 . . Take 1 . .
  • 36. flour|3 four|4 →fou* fourier|3 furious|2 o u l i e r | f o u r | 3 u 4 r 2 i o u s | Find pre x. Depth-in traversal for completions. PQ on score|alpha . . Take 1 . .
  • 37. 2furious 3flour 3fourier 4four . . Take 2 . .
  • 38. 2furious 3flour →fou* 3fourier 4four u f r i o u 2 s l 3 f u r i e r o 4 u f o From score roots, until N collected. Find pre x. Depth-in traversal for completions, stop if N collected. Find/boost exact match. . . Take 2 . .
  • 39. 2furious 5urious|furious 5rious|furious 5ious|furious 5ous|furious 5us|furious 5s|furious 3flour … . Take 3 (in xes) . . .
  • 40. 2 i o o u i r s r s | f 5 u s u 4 o u 6 r u r r | . i o u s f 7 u r o l f o i e 3 l u r e f o | f r | r i l o u e i o | | r i u r u .
  • 41. Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length.
  • 42. Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length. Exact matches only. Static snapshot (not incremental). Discretized weights.
  • 43. Top50KWiki.utf8, 676 KB, 50 000 terms Jaspell TST FST . RAM [B] . 7 869 415 . 7 914 524 300 .175 queries per second,. . . tpq . PREFIX [100-200] . 458 . 966 . 742 . PREFIX [6-9] . 330 . 228 . 659 . PREFIX [2-4] . 126 . 29 . . 501
  • 45. Summary and Conclusions Automata compact, powerful, efficient data structure Lucene/Solr bene ts behind the scenes, but spreading: index, queries, suggesters API in Lucene …is shaped right now, still @experimental