1
Lucene 101
June 14, 2014!
Varun Thacker!
@varunthacker
Search | Discover | Analyze
Agenda
• Apache Lucene - An Introduction!
• Inverted Index!
• Lucene Scoring - TF-IDF!
• Schema Analysis!
• DocValues!
• Commit strategy - autoCommit, autoSoftCommit !
• Merges
2
Apache Lucene - An Introduction!
• High Performance Search Engine Library
written in Java!
• Application agnostic Index & Search Bytes
(usually UTF-8 text)!
• Zero dependencies!
• Low Level API to build scalable search
solutions
3
Apache Lucene - An Introduction!
• Provides a wide range of language
analysis tools!
• Lots of modules - (Analysis, Spellchecking,
Highlighting...)!
• Supports Near Real Time search
4
Inverted Index!
5
Inverted Index
6
Schema Analysis
7
Lucene Scoring - TF-IDF!
• TF - number of occurrences of the term in
the document.!
• IDF - Is a measure of how unique or rare
the term is.!
• Normalizations - Both at index time and at
query time!
• Coordination factor - number of matches of
the query term in each document!
• These statistics are per field
8
Lucene Scoring - TF-IDF!
• Lucene provides many other similarity
models which can be plugged in - !
• Okapi BM25!
• Language Models!
• Divergence from randomness!
• For Solr users - &debugQuery=true will
explain how these factors contributed to
the final score for each document!
9
DocValues!
• DocValues are column-oriented fields!
• Loads faster as the values don't need to be
UnInverted!
• Good for faceting / grouping / sorting on
fields.!
• For Solr users - <field name="field_name"
type="string" indexed="true" stored="true"
docValues="true" />
10
What’s in a commit
• A commit operation makes index changes
visible to new search requests.!
• hard commit -!
• Calls fsync on the index files to ensure
they have been flushed to stable storage
and no data loss will result from a power
failure.!
• Commit is a costly operation, and doing
so frequently will slow down your
11
What’s in a commit
• soft commit - !
• Makes index changes visible and does
not fsync index files or write a new index
descriptor.!
• Fast operation!
• If the JVM crashes or there is a loss of
power, changes that occurred after the
last hard commit will be lost
12
Merges
• Every commit creates a new segment!
• Merge factor controls the number of segments
in an index.!
• Helps keep the number of file handles small !
• It’s an expensive operation!!
• More segments means searches are slightly
slower
13
Thank you!
• Questions?!
• Email - varun.thacker@lucidworks.com
14
We’re Hiring Solr Developers
Email - careers@lucidworks.com

Lucene 101

  • 1.
    1 Lucene 101 June 14,2014! Varun Thacker! @varunthacker Search | Discover | Analyze
  • 2.
    Agenda • Apache Lucene- An Introduction! • Inverted Index! • Lucene Scoring - TF-IDF! • Schema Analysis! • DocValues! • Commit strategy - autoCommit, autoSoftCommit ! • Merges 2
  • 3.
    Apache Lucene -An Introduction! • High Performance Search Engine Library written in Java! • Application agnostic Index & Search Bytes (usually UTF-8 text)! • Zero dependencies! • Low Level API to build scalable search solutions 3
  • 4.
    Apache Lucene -An Introduction! • Provides a wide range of language analysis tools! • Lots of modules - (Analysis, Spellchecking, Highlighting...)! • Supports Near Real Time search 4
  • 5.
  • 6.
  • 7.
  • 8.
    Lucene Scoring -TF-IDF! • TF - number of occurrences of the term in the document.! • IDF - Is a measure of how unique or rare the term is.! • Normalizations - Both at index time and at query time! • Coordination factor - number of matches of the query term in each document! • These statistics are per field 8
  • 9.
    Lucene Scoring -TF-IDF! • Lucene provides many other similarity models which can be plugged in - ! • Okapi BM25! • Language Models! • Divergence from randomness! • For Solr users - &debugQuery=true will explain how these factors contributed to the final score for each document! 9
  • 10.
    DocValues! • DocValues arecolumn-oriented fields! • Loads faster as the values don't need to be UnInverted! • Good for faceting / grouping / sorting on fields.! • For Solr users - <field name="field_name" type="string" indexed="true" stored="true" docValues="true" /> 10
  • 11.
    What’s in acommit • A commit operation makes index changes visible to new search requests.! • hard commit -! • Calls fsync on the index files to ensure they have been flushed to stable storage and no data loss will result from a power failure.! • Commit is a costly operation, and doing so frequently will slow down your 11
  • 12.
    What’s in acommit • soft commit - ! • Makes index changes visible and does not fsync index files or write a new index descriptor.! • Fast operation! • If the JVM crashes or there is a loss of power, changes that occurred after the last hard commit will be lost 12
  • 13.
    Merges • Every commitcreates a new segment! • Merge factor controls the number of segments in an index.! • Helps keep the number of file handles small ! • It’s an expensive operation!! • More segments means searches are slightly slower 13
  • 14.
    Thank you! • Questions?! •Email - varun.thacker@lucidworks.com 14 We’re Hiring Solr Developers Email - careers@lucidworks.com