Apache Lucene (Core) 
November, 2013 
Engin Yöyen
What is Lucene? 
• Information retrieval library / Text search-engine 
• Build with Java 
• High performance, scalable 
• Turn-key solution
Document Model
Indexing
Inverted Index 
id term docId 
1 take 3 
2 step 3 
3 hang 1 
4 right 1,2 
5 people 2,3 
6 consider 2 
7 wrong 2
Queries 
• Field based (author:tolstoy content:people) 
• Boolean (people AND fear) (+people +fear -fun) 
• Wildcard (fe?r, peop?) 
• Fuzzy (mist~, mist~0.9) 
• Proximity ("sooner later”~1) 
• Range (1850 to 1890) 
• Boost factor (people^2 fear)
Tokenizers&Token Filters 
the quick brown fox jumped over the lazy dogs 
• WhitespaceTokenizer (“the” “quick” “brown” “fox” “jumped” “over” 
“the” “lazy” “dogs”) 
• Standard Tokenizer (“quick” “brown” “fox” “jumped” “over” “lazy” 
“dogs”) 
• Stem filter (waiting -> wait) 
• Lower case filter 
• Synonym filter 
• and more….
Performance 
over 150GB/hour
Questions?

Apache Lucene

  • 1.
    Apache Lucene (Core) November, 2013 Engin Yöyen
  • 2.
    What is Lucene? • Information retrieval library / Text search-engine • Build with Java • High performance, scalable • Turn-key solution
  • 3.
  • 4.
  • 5.
    Inverted Index idterm docId 1 take 3 2 step 3 3 hang 1 4 right 1,2 5 people 2,3 6 consider 2 7 wrong 2
  • 6.
    Queries • Fieldbased (author:tolstoy content:people) • Boolean (people AND fear) (+people +fear -fun) • Wildcard (fe?r, peop?) • Fuzzy (mist~, mist~0.9) • Proximity ("sooner later”~1) • Range (1850 to 1890) • Boost factor (people^2 fear)
  • 7.
    Tokenizers&Token Filters thequick brown fox jumped over the lazy dogs • WhitespaceTokenizer (“the” “quick” “brown” “fox” “jumped” “over” “the” “lazy” “dogs”) • Standard Tokenizer (“quick” “brown” “fox” “jumped” “over” “lazy” “dogs”) • Stem filter (waiting -> wait) • Lower case filter • Synonym filter • and more….
  • 8.
  • 9.