• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Lucene Bootcamp -1
 

Lucene Bootcamp -1

on

  • 3,325 views

 

Statistics

Views

Total Views
3,325
Views on SlideShare
3,323
Embed Views
2

Actions

Likes
2
Downloads
96
Comments
0

2 Embeds 2

http://localhost 1
http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lucene Bootcamp -1 Lucene Bootcamp -1 Presentation Transcript

  • Lucene Boot Camp I
    • Grant Ingersoll
    • Lucid Imagination
    • Nov. 3, 2008
    • New Orleans, LA
  • Intro
    • My Background
    • Goals for Tutorial
      • Understand Lucene core capabilities
      • Real examples, real code, real data
    • Ask Questions!!!!!
  • Schedule
    • Day I
      • Concepts
      • Indexing
      • Searching
      • Analysis
      • Lucene contrib: highlighter, spell checking, etc.
    • Day II
      • In-depth Indexing/Searching
        • Performance, Internals
      • Terms and Term Vectors
      • Class Project
      • Q & A
  • Resources
    • Slides at
      • http://www.lucenebootcamp.com/boot-camp-slides/
    • Lucene Java
      • http://lucene.apache.org/java
      • http://lucene.apache.org/java/2_4_0/
      • http://lucene.apache.org/java/2_4_0/api/index.html
    • Luke:
      • http://www.getopt.org/luke
  • What is Search?
    • Given a user’s information need (query), find documents relevant to the need
      • Very Subjective!
    • Information Retrieval
      • Interdisciplinary
      • Comp. Sci, Math/Statistics, Library Sci., Linguistics, AI…
  • Search Use Cases
    • Web
      • Google, Y!, etc.
    • Enterprise
      • Intranet, Content Repositories, email, etc.
    • eCommerce/DB/CMS
      • Online Stores, websites, etc.
    • Other
      • QA, Federated
    • Yours? Why do you need Search?
  • Your Content And You
    • Only you know your content!
      • Key Features
        • Title, body, price, margin, etc.
      • Important Terms
      • Synonyms/Jargon
      • Structures (tables, lists, etc.)
      • Importance
      • Priorities
  • Search Basics
    • Many different Models:
      • Boolean, Probabilistic, Inference, Neural Net, and:
    • Modified Vector Space Model (VSM)
      • Boolean + VSM
      • TF-IDF
      • The words in the document and the query each define a Vector in an n-dimensional space
      • Sim(q1, d1) = cos Θ
    d j = <w 1,j ,w 2,j ,…,w n,j > q= <w 1,q ,w 2,q ,…w n,q > w = weight assigned to term q 1 d 1 Θ
  • Inverted Index From “Taming Text”
  • Lucene Background
    • Created by Doug Cutting in 1999
    • Donated to ASF in 2001
    • Morphed into a Top Level Project (TLP) with many sub projects
      • Java (flagship) a.k.a. “Lucene”
      • Solr, Nutch, Mahout, Tika, several Lucene ports
    • From here on out, Lucene refers to “Lucene Java”
  • Lucene is…
    • NOT a crawler
      • See Nutch
    • NOT an application
      • See PoweredBy on the Wiki
    • NOT a library for doing Google PageRank or other link analysis algorithms
      • See Nutch
    • A library for enabling text based search
  • A Few Words about Solr
    • HTTP-based Search Server
    • XML Configuration
    • XML, JSON, Ruby, PHP, Java support
    • Many, many nice features that Lucene users need
      • Faceting, spell checking, highlighting
      • Caching, Replication, Distributed
    • http://lucene.apache.org/solr
  • Indexing
    • Process of preparing and adding text to Lucene, which stores it in an inverted index
    • Key Point: Lucene only indexes Strings
      • What does this mean?
        • Lucene doesn’t care about XML, Word, PDF, etc.
          • There are many good open source extractors available
        • It’s our job to convert whatever file format we have into something Lucene can use
  • Indexing Classes
    • Analyzer
      • Creates tokens using a Tokenizer and filters them through zero or more TokenFilter s
    • IndexWriter
      • Responsible for converting text into internal Lucene format
    • Directory
      • Where the Index is stored
      • RAMDirectory , FSDirectory , others
  • Indexing Classes
    • Document
      • A collection of Field s
      • Can be boosted
    • Field
      • Free text, keywords, dates, etc.
      • Defines attributes for storing, indexing
      • Can be boosted
      • Field Constructors and parameters
        • Open up Fieldable and Field in IDE
  • How to Index
    • Create IndexWriter
    • For each input
      • Create a Document
      • Add Field s to the Document
      • Add the Document to the IndexWriter
    • Close the IndexWriter
    • Optimize (optional)
  • Indexing in a Nutshell
    • For each Document
      • For each Field to be tokenized
        • Create the tokens using the specified Tokenizer
          • Tokens consist of a String, position, type and offset information
        • Pass the tokens through the chained TokenFilter s where they can be changed or removed
        • Add the end result to the inverted index
    • Position information can be altered
      • Useful when removing words or to prevent phrases from matching
  • Task 1.a
    • From the Boot Camp Files, use the basic.ReutersIndexer skeleton to start
    • Index the small Reuters Collection using the IndexWriter , a Directory and StandardAnalyzer
      • Boost every 10 documents by 3
    • Questions to Answer:
      • What Field s should I define?
      • What attributes should each Field have?
      • Pick a field to boost and give a reason why you think it should be boosted
    • ~30 minutes
  • Use Luke
  • 5 minute Break
  • Searching
    • Parse user query
    • Lookup matching Documents
    • Score Documents
    • Return ranked list
  • Key Classes:
    • Searcher
      • Provides methods for searching
      • Take a moment to look at the Searcher class declaration
      • IndexSearcher, MultiSearcher, ParallelMultiSearcher
    • IndexReader
      • Loads a snapshot of the index into memory for searching
      • More tomorrow
    • TopDocs - The search results
    • QueryParser
      • http: //lucene .apache. org/java/docs/queryparsersyntax .html
    • Query
      • Logical representation of program’s information need
  • Query Parsing
    • Basic syntax:
      • title:hockey +(body:stanley AND body:cup)
    • OR/AND must be uppercase
    • Default operator is OR (can be changed)
    • Supports fairly advanced syntax, see the website
      • http://lucene.apache.org/java/docs/queryparsersyntax.html
    • Doesn’t always play nice, so beware
      • Many applications construct queries programmatically or restrict syntax
  • How to Search
    • Create/Get an IndexSearcher
    • Create a Query
      • Use a QueryParser
      • Construct it programmatically
    • Display the results from the TopDocs
      • Retrieve Field values from Document
    • More tomorrow on search lifecyle
  • Task 1.b
    • Using the ReutersIndexerTest.java skeleton in the boot camp files
      • Search your newly created index using queries you develop
    • Questions:
      • What is the default field for the QueryParser ?
      • What Analyzer to use?
    • ~20 minutes
  • Task 1 Results
    • Scores across queries are NOT comparable
      • They may not even be comparable for the same query over time (if the index changes)
    • Performance
      • Caching
      • Warming
      • More Tomorrow
  • Lunch 1-2:30
  • Discussion/Questions
    • So far, we’ve seen the basics of search and indexing
    • Next going to look into Analysis and Contrib modules
  • Analysis
    • Analysis is the process of creating Token s to be indexed
    • Analysis is usually done to improve results overall, but it comes with a price
    • Lucene comes with many different Analyzer s, Tokenizer s and TokenFilter s, each with their own goals
    • StandardAnalyzer is included with the core JAR and does a good job for most English and Latin-based tasks
    • Often times you want the same content analyzed in different ways
    • Consider a catch-all Field in addition to other Field s
  • Solr’s Analysis tool
    • If you use nothing else from Solr, the Admin analysis tool can really help you understand analysis
    • Download Solr and unpack it
    • cd apache-solr-1.3.0/example
    • java -jar start.jar
    • http://localhost:8983/solr/admin/analysis.jsp
  • Analyzers
    • StandardAnalyzer, WhitespaceAnalyzer, SimpleAnalyzer
    • Contrib/analysis
      • Suite of Analyzers for many common situations
        • Languages
        • n-grams
        • Payloads
    • Contrib/snowball
  • Tokenization
    • Split words into Token s to be processed
    • Tokenization is fairly straightforward for most languages that use a space for word segmentation
      • More difficult for some East Asian languages
      • See the CJK Analyzer
  • Modifying Tokens
    • TokenFilter s are used to alter the token stream to be indexed
    • Common tasks:
      • Remove stopwords
      • Lower case
      • Stem/Normalize -> Wi-Fi -> Wi Fi
      • Add Synonyms
    • StandardAnalyzer does things that you may not want
  • Payloads
    • Associate an arbitrary byte array with a term in the index
    • Uses
      • Part of Speech
      • Font weight
      • URL
    • Currently can search using the BoostingTermQuery
  • n-grams
    • Combine units of content together into a single token
    • Character
      • 2-grams for the word “Lucene”:
        • Lu,uc, ce, en, ne
      • Can make search possible when data is noisy or hard to tokenize
    • Word (“shingles” in Lucene parlance)
      • Pseudo Phrases
  • Custom Analyzers
    • Problem: none of the Analyzers cover my problem
    • Solution: write your own Analyzer
    • Better solution: write a configurable Analyzer so you only need one Analyzer that you can easily change for your projects
      • See Solr
  • Analysis APIs
    • Have a look at the TokenStream and Token API s
    • Token s and TokenStream s may be reused
      • Helps reduce allocations and speeds up indexing
      • Not all Analysis can take advantage: caching
      • Analyzer.reusableTokenStream()
      • TokenStream.next(Token)
  • Special Cases
    • Dates and numbers need special treatment to be searchable
      • o.a.l.document.DateTools
      • org.apache.solr.util.NumberUtils
    • Altering Position Information
      • Increase Position Gap between sentences to prevent phrases from crossing sentence boundaries
      • Index synonyms at the same position so query can match regardless of synonym used
  • Task 2
    • Take 15-20 minutes and write an Analyzer/Tokenizer/TokenFilter and Unit Test
      • Examine results in Luke
      • Run some searches
    • Ideas:
      • Combine existing Tokenizer s and TokenFilter s
      • Normalize abbreviations
      • Add payloads
      • Filter out all words beginning with the letter A
      • Identify/Mark sentences
  • Discussion
    • What did you implement?
    • What issues do you face with your content?
    • To Stem or not to Stem?
    • Stopwords: good or bad?
    • Tradeoffs of different techniques
  • Lucene Contributions
    • Many people have generously contributed code to help solve common problems
    • These are in contrib directory of the source
    • Popular:
      • Analyzers
      • Highlighter
      • Queries and MoreLikeThis
      • Snowball Stemmers
      • Spellchecker
  • Highlighter
    • Highlight query keywords in context
      • Often useful for display purposes
    • Important Classes:
      • Highlighter - Main entry point, coordinates the work
      • Fragmenter - Splits up document for scoring
      • Formatter - Marks up the results
      • Scorer - Scores the fragments
        • SpanScorer - Can score phrases
    • Use term vectors for performance
    • Look at example usage
  • Spell Checking
    • Suggest spelling corrections based on spellings of words in the index
      • Will/can suggest incorrectly spelled words
    • Uses a distance measure to determine suggestions
      • Can also factor in document frequency
      • Distance Measure is pluggable
  • Spell Checking
    • Classes: Spellchecker , StringDistance
    • See ContribExamplesTest
    • Practical aspects:
      • It’s not as simple as just turning it on
      • Good results require testing and tuning
        • Pay attention to accuracy settings
        • Mind your Analysis (simple, no stemming)
        • Consider alternate StringDistance ( JaroWinklerDistance )
  • More Like This
    • Given a Document , find other Document s that are similar
      • Variation on relevance feedback
      • “ Find Similar”
    • Extracts the most important terms from a Document and creates a new query
      • Many options available for determining important terms
    • Classes: MoreLikeThis
      • See ContribExamplesTest
  • Summary
    • Indexing
    • Searching
    • Analysis
    • Contrib
    • Questions?
  • Resources
    • http://lucene.apache.org/
    • http://en.wikipedia.org/wiki/Vector_space_model
    • Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
    • Lucene In Action by Hatcher and Gospodnetić
    • Wiki
    • Mailing Lists
      • [email_address]
        • Discussions on how to use Lucene
      • [email_address]
        • Discussions on how to develop Lucene
    • Issue Tracking
      • https://issues.apache.org/jira/secure/Dashboard.jspa
    • We always welcome patches
      • Ask on the mailing list before reporting a bug
  • Resources
    • [email_address]
    • Lucid Imagination
      • Support
      • Training
      • Value Add
      • [email_address]