SlideShare a Scribd company logo
1 of 25
Finite-State Queries in Lucene
             Robert Muir
          rmuir@apache.org
Agenda
• Introduction to Lucene
• Improving inexact matching:
  – Background
  – Regular Expression, Wildcard, Fuzzy Queries
• Additional use cases:
  – Language support: expansion versus
    stemming
  – Improved spellchecking
• Other ongoing developments in Lucene
Introduction to Lucene
• Open Source Search Engine Library
  – Not just Java, ported to other languages too.
  – Commercial support via several companies
• Just the library
  – Embed for your own uses, e.g. Eclipse
  – For a search server, see Solr
  – For web search + crawler, see Nutch
• Website: http://lucene.apache.org
Inverted Index
• Like a TreeMap<Terms, Documents>
• Before 2.9, queries only operate on one
  “subtree”.
  – For example, Regex and Fuzzy exhaustively
    evaluate all terms, unless you give them a
    “constant prefix”.
• TermQuery is just a special case, looks at
  one leaf node.
Lucene 2.9: Fast Numeric Ranges
• Indexes at different levels of precision.
• Enumerates multiple subtrees.
  – But typically this is a small number: e.g. 15
• Query APIs improved to support this.

                   4                       5                           6



            42           44            52                  63                    64



      421    423   445   446   448   521       522   632   633   634       641   642   644
Automaton Queries
• Only explore subtrees that can lead to an
  accept state of some finite state machine.
• AutomatonQuery traverses the term
  dictionary and the state machine in parallel
Another way to think of it
• Index as a state machine that recognizes Terms
  and transduces matching Documents.
• AutomatonQuery represents a user’s search
  need as a FSM.
• The intersection of the two emits search results.
Query API improvements
• Automata might need to do many seeks
  around the term dictionary.
  – Depends on what is in term dictionary
  – Depends on state machine structure
• MultiTermQuery API further improved
  – Easier and more efficient to skip around.
  – Explicitly supports seeking.
Regex, Wildcard, Fuzzy
• Without constant prefix, exhaustive
  – Regex: (http|ftp)://foo.com
  – Wildcard: ?oo?ar
  – Fuzzy: foobar~
• Re-implemented as automata queries
  – Just parsers that produce a DFA
  – Improved performance and scalability
  – (http|ftp)://foo.com examines 2 terms.
Additional/Advanced Use Cases
Stemming
• Stemmers work at index and query time
  – walked, walking -> walk
  – Can increase retrieval effectiveness
• Some problems
  – Mistakes: international -> intern
  – Must determine language of documents
  – Multilingual cases can get messy
  – Tuning is difficult: must re-index
  – Unfriendly: wildcards on stemmed terms…
Expansion instead
• Don’t remove data at index time
  – Expand the query instead.
  – Single field now works well for all queries:
    exact match, wildcard, expanded, etc.
• Simplifies search configuration
  – Tuning relevance is easier, no re-indexing.
  – No need to worry about language ID for docs.
  – Multilingual case is much simpler.
Automata expansion
• Natural fit for morphology
• Use set intersection operators
  – Minus to subtract exact match case
  – Union to search multiple languages
• Efficient operation
  – Doesn’t explode for languages with complex
    morphology
Experimental results
• 125k docs English test collection
   • Results are for TD queries
• Inverted the “S-Stemmer”
   • 6 declarative rewrite rules to regex
• Competitive with traditional stemming.
                  No      Porter    S-Stem    Automaton
               Stemming                        S-Stem
      MAP       0.4575    0.5069    0.5029     0.4979
      MRR       0.8070    0.7862    0.7587     0.8220
     # Terms   336,675    280,061   305,710    336,675
TODO:
• Support expansion models, too in Lucene.
• Language-specific resources
  – lucene-hunspell could provide these
• Language-independent tokenization
  – Unicode rules go a long way.
• Scoring that doesn’t need stopwords
  – For now, use CommonGrams!
Spellchecking




• Lucene spellchecker builds a separate
  index to find correction candidates
• Perhaps our fuzzy enumeration is now fast
  enough for small edit distances (e.g. 1,2)
  to just use the index directly.
• Could simplify configurations, especially
  distributed ones.
Ongoing developments in Lucene
Community
• Merging Lucene and Solr development
  – Still two separate released “products”!!!
  – Share mailing list and code repository
  – Solr dev code in sync with Lucene dev code
• Benefits to both Lucene and Solr users
  – Lucene features exposed to Solr faster
  – Solr features available to Lucene users
Indexing
• Flexible Indexing
  – Customize the format of the index
  – Decreased RAM usage
  – Faster IndexReader open and faster seeking
• Future
  – Serialize custom attributes to the index
  – More RAM savings
  – Improved index compression
  – Faster Near-Real-Time performance
Text Analysis
• Improved Unicode Support
  – Unicode 4/Supplementary support
  – ICU integration (to support Unicode 5.2)
• Improved Language Support
  – More languages
  – Faster indexing performance
  – Easier packaging and ease-of-use
• Improvements to Analysis API
Relevance and Scoring
• Improved relevance benchmarking
  – Open Relevance Project
  – Quality Benchmarking Package
• Future: More flexible scoring
  – How to support BM25, DFR, more vector
    space?
  – Some packages/patches available, but
     • Additional index statistics needed to be simple
Other
• Improvements to Spatial support
  – See Chris Male’s talk here:
    http://vimeo.com/10204365
  – Progress in both Lucene and Solr
• Steps towards a more modular
  architecture
  – Smaller Lucene core
  – Separate modules with more functionality
Questions?
Backup slides
Finite State Queries In Lucene

More Related Content

What's hot

OpenSearch
OpenSearchOpenSearch
OpenSearch
hchen1
 

What's hot (20)

Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
 
OpenSearch
OpenSearchOpenSearch
OpenSearch
 
All of the Performance Tuning Features in Oracle SQL Developer
All of the Performance Tuning Features in Oracle SQL DeveloperAll of the Performance Tuning Features in Oracle SQL Developer
All of the Performance Tuning Features in Oracle SQL Developer
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
 
NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?
 
Redis Reliability, Performance & Innovation
Redis Reliability, Performance & InnovationRedis Reliability, Performance & Innovation
Redis Reliability, Performance & Innovation
 
Azure Database Services for MySQL PostgreSQL and MariaDB
Azure Database Services for MySQL PostgreSQL and MariaDBAzure Database Services for MySQL PostgreSQL and MariaDB
Azure Database Services for MySQL PostgreSQL and MariaDB
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
Data Federation with Apache Spark
Data Federation with Apache SparkData Federation with Apache Spark
Data Federation with Apache Spark
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdf
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
 
Introduction to Azure Functions
Introduction to Azure FunctionsIntroduction to Azure Functions
Introduction to Azure Functions
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
 
OSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchOSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearch
 
MySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptxMySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptx
 

Viewers also liked

Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
lucenerevolution
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
Josiane Gamgo
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
Tommaso Teofili
 

Viewers also liked (20)

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Lucene
LuceneLucene
Lucene
 
Lucandra
LucandraLucandra
Lucandra
 
Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in lucene
 
Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 

Similar to Finite State Queries In Lucene

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
GokulD
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
GokulD
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
rcmuir
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
lucenerevolution
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
lucenerevolution
 
Real world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsReal world RESTful service development problems and solutions
Real world RESTful service development problems and solutions
Masoud Kalali
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 

Similar to Finite State Queries In Lucene (20)

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and Abstractions
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Lucene 101
Lucene 101Lucene 101
Lucene 101
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
 
Real world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsReal world RESTful service development problems and solutions
Real world RESTful service development problems and solutions
 
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
 
Solr 4
Solr 4Solr 4
Solr 4
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Fun with flexible indexing
Fun with flexible indexingFun with flexible indexing
Fun with flexible indexing
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 

Finite State Queries In Lucene

  • 1. Finite-State Queries in Lucene Robert Muir rmuir@apache.org
  • 2. Agenda • Introduction to Lucene • Improving inexact matching: – Background – Regular Expression, Wildcard, Fuzzy Queries • Additional use cases: – Language support: expansion versus stemming – Improved spellchecking • Other ongoing developments in Lucene
  • 3. Introduction to Lucene • Open Source Search Engine Library – Not just Java, ported to other languages too. – Commercial support via several companies • Just the library – Embed for your own uses, e.g. Eclipse – For a search server, see Solr – For web search + crawler, see Nutch • Website: http://lucene.apache.org
  • 4. Inverted Index • Like a TreeMap<Terms, Documents> • Before 2.9, queries only operate on one “subtree”. – For example, Regex and Fuzzy exhaustively evaluate all terms, unless you give them a “constant prefix”. • TermQuery is just a special case, looks at one leaf node.
  • 5. Lucene 2.9: Fast Numeric Ranges • Indexes at different levels of precision. • Enumerates multiple subtrees. – But typically this is a small number: e.g. 15 • Query APIs improved to support this. 4 5 6 42 44 52 63 64 421 423 445 446 448 521 522 632 633 634 641 642 644
  • 6. Automaton Queries • Only explore subtrees that can lead to an accept state of some finite state machine. • AutomatonQuery traverses the term dictionary and the state machine in parallel
  • 7. Another way to think of it • Index as a state machine that recognizes Terms and transduces matching Documents. • AutomatonQuery represents a user’s search need as a FSM. • The intersection of the two emits search results.
  • 8. Query API improvements • Automata might need to do many seeks around the term dictionary. – Depends on what is in term dictionary – Depends on state machine structure • MultiTermQuery API further improved – Easier and more efficient to skip around. – Explicitly supports seeking.
  • 9. Regex, Wildcard, Fuzzy • Without constant prefix, exhaustive – Regex: (http|ftp)://foo.com – Wildcard: ?oo?ar – Fuzzy: foobar~ • Re-implemented as automata queries – Just parsers that produce a DFA – Improved performance and scalability – (http|ftp)://foo.com examines 2 terms.
  • 11. Stemming • Stemmers work at index and query time – walked, walking -> walk – Can increase retrieval effectiveness • Some problems – Mistakes: international -> intern – Must determine language of documents – Multilingual cases can get messy – Tuning is difficult: must re-index – Unfriendly: wildcards on stemmed terms…
  • 12. Expansion instead • Don’t remove data at index time – Expand the query instead. – Single field now works well for all queries: exact match, wildcard, expanded, etc. • Simplifies search configuration – Tuning relevance is easier, no re-indexing. – No need to worry about language ID for docs. – Multilingual case is much simpler.
  • 13. Automata expansion • Natural fit for morphology • Use set intersection operators – Minus to subtract exact match case – Union to search multiple languages • Efficient operation – Doesn’t explode for languages with complex morphology
  • 14. Experimental results • 125k docs English test collection • Results are for TD queries • Inverted the “S-Stemmer” • 6 declarative rewrite rules to regex • Competitive with traditional stemming. No Porter S-Stem Automaton Stemming S-Stem MAP 0.4575 0.5069 0.5029 0.4979 MRR 0.8070 0.7862 0.7587 0.8220 # Terms 336,675 280,061 305,710 336,675
  • 15. TODO: • Support expansion models, too in Lucene. • Language-specific resources – lucene-hunspell could provide these • Language-independent tokenization – Unicode rules go a long way. • Scoring that doesn’t need stopwords – For now, use CommonGrams!
  • 16. Spellchecking • Lucene spellchecker builds a separate index to find correction candidates • Perhaps our fuzzy enumeration is now fast enough for small edit distances (e.g. 1,2) to just use the index directly. • Could simplify configurations, especially distributed ones.
  • 18. Community • Merging Lucene and Solr development – Still two separate released “products”!!! – Share mailing list and code repository – Solr dev code in sync with Lucene dev code • Benefits to both Lucene and Solr users – Lucene features exposed to Solr faster – Solr features available to Lucene users
  • 19. Indexing • Flexible Indexing – Customize the format of the index – Decreased RAM usage – Faster IndexReader open and faster seeking • Future – Serialize custom attributes to the index – More RAM savings – Improved index compression – Faster Near-Real-Time performance
  • 20. Text Analysis • Improved Unicode Support – Unicode 4/Supplementary support – ICU integration (to support Unicode 5.2) • Improved Language Support – More languages – Faster indexing performance – Easier packaging and ease-of-use • Improvements to Analysis API
  • 21. Relevance and Scoring • Improved relevance benchmarking – Open Relevance Project – Quality Benchmarking Package • Future: More flexible scoring – How to support BM25, DFR, more vector space? – Some packages/patches available, but • Additional index statistics needed to be simple
  • 22. Other • Improvements to Spatial support – See Chris Male’s talk here: http://vimeo.com/10204365 – Progress in both Lucene and Solr • Steps towards a more modular architecture – Smaller Lucene core – Separate modules with more functionality