SEARCH ME
Using Lucene.Net In Your Apps
About Me
   Zachary Johnson Gramana
   Engineer at Potts Consulting Group
   Proud new father of Rex
Search is...
   A vague term that encompasses multiple
    problems.
   Better term is “information retrieval”, or IR
    system.
   Interdisciplinary, drawing from:
     computer   science (parsing, data structures)
     psychology (query grammar, human/computer
      interact.)
     linguistics (textual analysis)

     information science (scoring/relevancy)

     maths (document retrieval strategy)
Problems Solved
   Information Overload
   Transparently handle all kinds of data:
     structured (hierarchical)
     semi-structured (markup)

     un-structured data (plain text)
Problems Solved
    Information Overload
      Find  the information that users want,
       not just the information they asked for.
    Transparently handle all kinds of data:
      structured (hierarchical)
      semi-structured (markup)

      un-structured data (plain text)

    Single portal to multiple data types and
     sources.
    Do it fast!
Basic IR System Capabilities
   Collection (importing, crawling)
       Anonymous web page crawling (google)
       User-uploaded photographs (flickr)
       Publisher upload of .mp3 files (iTunes)
   Indexing
       Analysis
       Modify index data structure
   Querying
       Input parsing
       Query generation & execution
       Collecting the results
       Filtering the results (optional)
What is Lucene.Net?
   Port of the Apache Foundation‟s Lucene
    libraries from Java to C#
   It‟s a search library.
   Lucene created by Doug Cutting
   Named after his wife.
   First released in 2000 on SourceForge
   Migrated to Apache Foundation in 9/2001.
Used By
   StackOverflow
   JIRA
   IBM
   Akamai
   Apple
   Autodesk
   Orchard
   RavenDB
   CouchDB
What Isn‟t Lucene.NET
   Not a complete information retrieval system
       Check out Google Search Appliance instead:
        http://www.google.com/enterprise/search/
   Not a web-crawler.
       Check out Arachnode instead
        http://arachnode.net
   Not a query service.
       Check out SOLR instead
        http://lucene.apache.org/solr
   Not hard
       Check out Windows Search SDK instead
        http://bit.ly/ImRtMk
Concept and Overview
What‟s In an Index?
   Stores a collection of Documents, each of
    which represent a source record.
   Document contain:
     Metadata   about the source record.
     (optionally) actual data from the source record.

     (optionally) derived analytical products.

   Documents store a collection of
    token/frequency pairs (optionally position),
    plus a document identifier.
Lucene‟s Index Structure
   Documents store a collection of fields.
   Fields are collection of terms, plus and identifier, and
    optional term vectors.
   Terms are string key-value-pairs of a field name, and
    a string value.
   Lucene provides special classes to deal with tricky
    data, like the NumericField class.
   Term vectors are terms, along with their frequency
    counts and positions.
   Fields can be indexed, stored, or both.
       Storing allows a term value to be retrieved after indexing.
       Indexing adds the term value to Lucene‟s inverted index.
The Inverted Index




     (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
Lucene‟s Index Structure
   What an „inverted index‟?
     verted   index: document points to collection of
      terms
     inverted index: term points to a collection of
      documents
   One or more segments
     Self-contained,   independent partition of the
      entire index.
     Stores: field names, field values, term dictionary,
      term frequencies, term proximities, normalization
      factor, term vectors, and (optional) deleted record
      lookup table.
Analysis




     (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
Tokenization




     (taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
Tokenization
   Normalization: “Gramåna” > “gramana”
   Stemming: “preschooling” > “school”
Norms




    (taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
Time to Look at Some Code
Getting a Query
   Two options:
     Parse a search string using a QueryParser class.
     Programatically build a query.

   QueryParser can build very complex queries
    very quickly, but requires user to provide a
    query string.
   Programatic building of a query requires less
    overhead for simple queries.
General Query Types




     (taken from the Wikipedia entry “Information Retrieval” at http://bitly.com/T1Qbw)
Some Lucene Query Types
   TermQuery (general purpose)
   BooleanQuery
   MultiPhraseQuery
   SpanQuery
   WildcardQuery
   FilteredQuery
   MoreLikeThisQuery
   BoostingQuery
   FuzzyQuery
   ConstantScoreRangeQuery
Time to Look at More Code
Lucene.Net Contribs
   Spatial (geo-spatial search)
   Similarity
   SimpleFactedSearch
   Highlighter
   SpellChecker
   WordNET (synonyms)
   Snowball (stemming library)
   RegEx
That‟s All!
Thanks for your time and attention.

twitter: @zgramana
blog: http://www.excitabyte.com/
Email: zgramanaATgee mail dot com

Search Me: Using Lucene.Net

  • 1.
  • 2.
    About Me  Zachary Johnson Gramana  Engineer at Potts Consulting Group  Proud new father of Rex
  • 3.
    Search is...  A vague term that encompasses multiple problems.  Better term is “information retrieval”, or IR system.  Interdisciplinary, drawing from:  computer science (parsing, data structures)  psychology (query grammar, human/computer interact.)  linguistics (textual analysis)  information science (scoring/relevancy)  maths (document retrieval strategy)
  • 4.
    Problems Solved  Information Overload  Transparently handle all kinds of data:  structured (hierarchical)  semi-structured (markup)  un-structured data (plain text)
  • 5.
    Problems Solved  Information Overload  Find the information that users want, not just the information they asked for.  Transparently handle all kinds of data:  structured (hierarchical)  semi-structured (markup)  un-structured data (plain text)  Single portal to multiple data types and sources.  Do it fast!
  • 6.
    Basic IR SystemCapabilities  Collection (importing, crawling)  Anonymous web page crawling (google)  User-uploaded photographs (flickr)  Publisher upload of .mp3 files (iTunes)  Indexing  Analysis  Modify index data structure  Querying  Input parsing  Query generation & execution  Collecting the results  Filtering the results (optional)
  • 7.
    What is Lucene.Net?  Port of the Apache Foundation‟s Lucene libraries from Java to C#  It‟s a search library.  Lucene created by Doug Cutting  Named after his wife.  First released in 2000 on SourceForge  Migrated to Apache Foundation in 9/2001.
  • 8.
    Used By  StackOverflow  JIRA  IBM  Akamai  Apple  Autodesk  Orchard  RavenDB  CouchDB
  • 9.
    What Isn‟t Lucene.NET  Not a complete information retrieval system  Check out Google Search Appliance instead: http://www.google.com/enterprise/search/  Not a web-crawler.  Check out Arachnode instead http://arachnode.net  Not a query service.  Check out SOLR instead http://lucene.apache.org/solr  Not hard  Check out Windows Search SDK instead http://bit.ly/ImRtMk
  • 10.
  • 11.
    What‟s In anIndex?  Stores a collection of Documents, each of which represent a source record.  Document contain:  Metadata about the source record.  (optionally) actual data from the source record.  (optionally) derived analytical products.  Documents store a collection of token/frequency pairs (optionally position), plus a document identifier.
  • 12.
    Lucene‟s Index Structure  Documents store a collection of fields.  Fields are collection of terms, plus and identifier, and optional term vectors.  Terms are string key-value-pairs of a field name, and a string value.  Lucene provides special classes to deal with tricky data, like the NumericField class.  Term vectors are terms, along with their frequency counts and positions.  Fields can be indexed, stored, or both.  Storing allows a term value to be retrieved after indexing.  Indexing adds the term value to Lucene‟s inverted index.
  • 13.
    The Inverted Index (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
  • 14.
    Lucene‟s Index Structure  What an „inverted index‟?  verted index: document points to collection of terms  inverted index: term points to a collection of documents  One or more segments  Self-contained, independent partition of the entire index.  Stores: field names, field values, term dictionary, term frequencies, term proximities, normalization factor, term vectors, and (optional) deleted record lookup table.
  • 15.
    Analysis (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
  • 16.
    Tokenization (taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
  • 17.
    Tokenization  Normalization: “Gramåna” > “gramana”  Stemming: “preschooling” > “school”
  • 18.
    Norms (taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
  • 19.
    Time to Lookat Some Code
  • 20.
    Getting a Query  Two options:  Parse a search string using a QueryParser class.  Programatically build a query.  QueryParser can build very complex queries very quickly, but requires user to provide a query string.  Programatic building of a query requires less overhead for simple queries.
  • 21.
    General Query Types (taken from the Wikipedia entry “Information Retrieval” at http://bitly.com/T1Qbw)
  • 22.
    Some Lucene QueryTypes  TermQuery (general purpose)  BooleanQuery  MultiPhraseQuery  SpanQuery  WildcardQuery  FilteredQuery  MoreLikeThisQuery  BoostingQuery  FuzzyQuery  ConstantScoreRangeQuery
  • 23.
    Time to Lookat More Code
  • 24.
    Lucene.Net Contribs  Spatial (geo-spatial search)  Similarity  SimpleFactedSearch  Highlighter  SpellChecker  WordNET (synonyms)  Snowball (stemming library)  RegEx
  • 25.
    That‟s All! Thanks foryour time and attention. twitter: @zgramana blog: http://www.excitabyte.com/ Email: zgramanaATgee mail dot com