Introduction to Full-text search<br />
About me<br />Full-time (Mostly) Java Developer<br />Part-time general technical/sysadmin/geeky guy<br />Interested in: ha...
Why should you care?<br />
Because every application needs search<br />
We live in an era of big, complex and connected applications.<br />
That means a lot of data<br />
But it's no use if you can't find anything!<br />
But it's no use if you can't quickly find anything something relevant<br />
Quick<br />
Relevant<br />
Customized Experience<br />
Deathy's Tip<br />You can't win by being generic, but you can be the best for your specific type of content.<br />
So back to our full-text search...<br />
Some core ideas<br />"index" (or "inverted index")<br />"document"<br />
Deathy’s Tip<br />Don't be too quick in deciding what a "document" is. Put some thought into it or you'll regret it (speak...
First we need some documents, more specifically some text samples<br />
Documents<br />Doc1: "The cow says moo"<br />Doc2: "The dog says woof"<br />Doc3: "The cow-dog says moof“<br />"Stolen" fr...
Important: individual words are the basis for the index<br />
Individual words<br />index = [<br />"cow",<br />	"dog",<br />	"moo",<br />	"moof",<br />	"The",<br />	"says",<br />	"woof...
For each word we have a list of documents to which it belongs<br />
Words, with appearances<br />index = {<br />	"cow": ["Doc1", "Doc3"],<br />	"dog": ["Doc2", "Doc3"],<br />	"moo": ["Doc1"]...
Q1: Find documents which contain "moo"<br />A1: index["moo"]<br />
Q2: Find documents which contain "The" and "dog"<br />A2: set(index["The"]) & set(index["dog"])<br />
Try to think of search as unions/intersections or other filters on sets.<br />
Most searches are using simple terms and "boolean" operators.<br />
“boolean”<br />"word"  - word MAY/SHOULD appear in document<br />"+word" - word MUST appear in document<br />"-word" - wor...
Example<br />Query: “+type:bookcontent:javacontent:python -content:ruby”<br />Find books, with "java" or "python" in conte...
Err...wait...what the hell does "content:java" mean?<br />
Reviewing the "document" concept<br />
An index consists out of one or more documents<br />
Each document consists of one or more "field"s. Each field has a name and content.<br />
Field examples<br />content<br />title<br />author<br />publication date<br />etc.<br />
So how are fields handled internally?<br />In most cases very simple. A word belongs to a specific field, so it can be sto...
New index example<br />index = {<br />	"content:cow": ["Doc1", "Doc3"],<br />	"content:dog": ["Doc2", "Doc3"],<br />	"cont...
But enough of that<br />
We missed the most important thing!<br />
We missedsaved the most important thing for last!<br />
Analysis<br />
or for mortals: how you get from a long text to small tokens/words/terms<br />
…borrowing from Lucene naming/API...<br />
(One) Tokenizer<br />
and zero or more Filters<br />
First...<br />
Some more interesting documents<br />Doc1: "The quick brown fox jumps over the lazy dog"<br />Doc2: "All Daleks: Extermina...
Tokenizer: Breaks up a single string into smaller tokens.<br />
You define what splitting rules are best for you.<br />
Whitespace Tokenizer<br />Just break into tokens wherever there is some space. So we get something like:<br />
Doc1: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]<br />Doc2: ["All", "Daleks:", "Exterminate!"...
But wait, that doesn't look right...<br />
So we apply Filters<br />
Filter<br />transforms one single token into another single token, multiple tokens or no token at all<br />you can apply m...
Filter 1: lower-case (since we don't want the search to be case-sensitive)<br />
Result<br />Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]<br />Doc2: ["all", "daleks:", "e...
Filter 2: remove punctuation<br />
Result<br />Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]<br />Doc2: ["all", "daleks", "ex...
Add more filter seasoning until it tastes just right.<br />
Lots of things you can do with filters<br />case normalization<br />removing unwanted/unneeded characters<br />translitera...
Possibilities are endless, enjoy experimenting with them!<br />
Just one warning…<br />
Always use the same analysis rules when indexing and when parsing search text entered by the user!<br />
I bet you want to start working with this<br />
Implementations<br />Lucene (Java main, .NET, Python, C )<br />SOLR if using from other languages<br />Xapian<br />Sphinx<...
Related Books<br />
The theory<br />Introduction to Information Retrieval<br />http://nlp.stanford.edu/IR-book/information-retrieval-book.html...
The practice (for Lucene at least):<br />Lucene in Action, second edition:<br />http://www.manning.com/hatcher3/<br />Warn...
Questions?<br />
Contact me<br />(with interesting problems involving lots of data  )<br />@deathy<br />cristian.vat@gmail.com<br />http:/...
Fin.<br />
So where’s the Halloween Party?<br />Happy Halloween !<br />
Upcoming SlideShare
Loading in …5
×

Introduction to Full-Text Search

1,975 views
1,861 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,975
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • I won&apos;t delve into specifics or actual implementations. I&apos;ll try to present main concepts which come from Information Retrieval theory and also essential components you should be aware of when dealing with any full-text search system. If interested, there could be a future presentation on actual implementations (Lucene in my case).
  • Java Web Developer-ish. Last 4 years worked mostly on electronic publishing applications: processing/searching/displaying various content sets of various sizes. Passion for big data and lots of it. ( Last weekend I was parallelizing indexing on a 800K document set so it uses as many cores as possible. On Friday I was indexing a data set of 5.8M documents... )
  • about fulltext search, or search in general
  • take your pick: lots of pictures, lots of friends, lots of blog posts
  • actually, scratch that..
  • much better..
  • fulltext search is usually VERY fast. and by adding your own custom one, you can make it faster for where your specific application needs it most.
  • Depending on your content and users you can have very specific relevance criteria. You can surprise your users with the quality of results.
  • various needs for various content- bitch about imobiliare.ro not having search in text or very dynamic filters. Example: cannot search for apartments to rent with internet access...- bitch about geekmeet.ro wordpress search not being able to filter based on category (Timisoara in this case)
  • &quot;index&quot; = where you add items which you want to find and where you search for them.&quot;document&quot; = the basic unit of indexing/searching. Usually one row from the search results list. Could be a book, a chapter, a page, a URL, etc.
  • Observe the sorting. More on this later...
  • not quite boolean, but simple enough to understand..
  • actual implementations vary and it usually shouldn&apos;t matter. Just remember that there are fields and documents and each indexed term is indexed for a specific field.
  • I&apos;m going Lucene here, but any good index/search API will let you customize this process. This is as many have found a good way to structure your process.
  • punctuation and various mixes of upper/lower-case in tokens.
  • Bitch about tokenizer/filter options (or lack thereof in Sphinx/MySQL)…
  • Introduction to Full-Text Search

    1. 1. Introduction to Full-text search<br />
    2. 2. About me<br />Full-time (Mostly) Java Developer<br />Part-time general technical/sysadmin/geeky guy<br />Interested in: hard problems, search, performance, paralellism, scalability<br />
    3. 3. Why should you care?<br />
    4. 4. Because every application needs search<br />
    5. 5. We live in an era of big, complex and connected applications.<br />
    6. 6. That means a lot of data<br />
    7. 7. But it's no use if you can't find anything!<br />
    8. 8. But it's no use if you can't quickly find anything something relevant<br />
    9. 9. Quick<br />
    10. 10. Relevant<br />
    11. 11. Customized Experience<br />
    12. 12. Deathy's Tip<br />You can't win by being generic, but you can be the best for your specific type of content.<br />
    13. 13. So back to our full-text search...<br />
    14. 14. Some core ideas<br />"index" (or "inverted index")<br />"document"<br />
    15. 15. Deathy’s Tip<br />Don't be too quick in deciding what a "document" is. Put some thought into it or you'll regret it (speaking from a lot of experience)<br />
    16. 16. First we need some documents, more specifically some text samples<br />
    17. 17. Documents<br />Doc1: "The cow says moo"<br />Doc2: "The dog says woof"<br />Doc3: "The cow-dog says moof“<br />"Stolen" from http://www.slideshare.net/tomdyson/being-google<br />
    18. 18. Important: individual words are the basis for the index<br />
    19. 19. Individual words<br />index = [<br />"cow",<br /> "dog",<br /> "moo",<br /> "moof",<br /> "The",<br /> "says",<br /> "woof"<br />]<br />
    20. 20. For each word we have a list of documents to which it belongs<br />
    21. 21. Words, with appearances<br />index = {<br /> "cow": ["Doc1", "Doc3"],<br /> "dog": ["Doc2", "Doc3"],<br /> "moo": ["Doc1"],<br /> "moof": ["Doc3"],<br /> "The": ["Doc1", "Doc2", "Doc3"],<br /> "says": ["Doc1", "Doc2", "Doc3"],<br /> "woof": ["Doc2"]<br />}<br />
    22. 22. Q1: Find documents which contain "moo"<br />A1: index["moo"]<br />
    23. 23. Q2: Find documents which contain "The" and "dog"<br />A2: set(index["The"]) & set(index["dog"])<br />
    24. 24. Try to think of search as unions/intersections or other filters on sets.<br />
    25. 25. Most searches are using simple terms and "boolean" operators.<br />
    26. 26. “boolean”<br />"word" - word MAY/SHOULD appear in document<br />"+word" - word MUST appear in document<br />"-word" - word MUST NOT appear in document<br />
    27. 27. Example<br />Query: “+type:bookcontent:javacontent:python -content:ruby”<br />Find books, with "java" or "python" in content but which don't contain "ruby" in content.<br />
    28. 28. Err...wait...what the hell does "content:java" mean?<br />
    29. 29. Reviewing the "document" concept<br />
    30. 30. An index consists out of one or more documents<br />
    31. 31. Each document consists of one or more "field"s. Each field has a name and content.<br />
    32. 32. Field examples<br />content<br />title<br />author<br />publication date<br />etc.<br />
    33. 33. So how are fields handled internally?<br />In most cases very simple. A word belongs to a specific field, so it can be stored in the term directly.<br />
    34. 34. New index example<br />index = {<br /> "content:cow": ["Doc1", "Doc3"],<br /> "content:dog": ["Doc2", "Doc3"],<br /> "content:moo": ["Doc1"],<br /> "content:moof": ["Doc3"],<br /> "content:The": ["Doc1", "Doc2", "Doc3"],<br /> "content:says": ["Doc1", "Doc2", "Doc3"],<br /> "content:woof": ["Doc2"],<br /> "type:example_documents": ["Doc1", "Doc2", "Doc3"]<br />}<br />
    35. 35. But enough of that<br />
    36. 36. We missed the most important thing!<br />
    37. 37. We missedsaved the most important thing for last!<br />
    38. 38. Analysis<br />
    39. 39. or for mortals: how you get from a long text to small tokens/words/terms<br />
    40. 40. …borrowing from Lucene naming/API...<br />
    41. 41. (One) Tokenizer<br />
    42. 42. and zero or more Filters<br />
    43. 43. First...<br />
    44. 44. Some more interesting documents<br />Doc1: "The quick brown fox jumps over the lazy dog"<br />Doc2: "All Daleks: Exterminate! Exterminate! EXTERMINATE!! EXTERMINATE!!!"<br />Doc3: "And the final score is: no TARDIS, no screwdriver, two minutes to spare. Who da man?!"<br />
    45. 45. Tokenizer: Breaks up a single string into smaller tokens.<br />
    46. 46. You define what splitting rules are best for you.<br />
    47. 47. Whitespace Tokenizer<br />Just break into tokens wherever there is some space. So we get something like:<br />
    48. 48. Doc1: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]<br />Doc2: ["All", "Daleks:", "Exterminate!", "Exterminate!", "EXTERMINATE!!", "EXTERMINATE!!!"]<br />Doc3: ["And", "the", "final", "score", "is:", "no", "TARDIS,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "Who", "da", "man?!"]<br />
    49. 49. But wait, that doesn't look right...<br />
    50. 50. So we apply Filters<br />
    51. 51. Filter<br />transforms one single token into another single token, multiple tokens or no token at all<br />you can apply more of them in a specific order<br />
    52. 52. Filter 1: lower-case (since we don't want the search to be case-sensitive)<br />
    53. 53. Result<br />Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]<br />Doc2: ["all", "daleks:", "exterminate!", "exterminate!", "exterminate!!", "exterminate!!!"]<br />Doc3: ["and", "the", "final", "score", "is:", "no", "tardis,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "who", "da", "man?!"]<br />
    54. 54. Filter 2: remove punctuation<br />
    55. 55. Result<br />Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]<br />Doc2: ["all", "daleks", "exterminate", "exterminate", "exterminate", "exterminate"]<br />Doc3: ["and", "the", "final", "score", "is", "no", "tardis", "no", "screwdriver", "two", "minutes", "to", "spare", "who", "da", "man"]<br />
    56. 56. Add more filter seasoning until it tastes just right.<br />
    57. 57. Lots of things you can do with filters<br />case normalization<br />removing unwanted/unneeded characters<br />transliteration/normalization of special characters<br />stopwords<br />synonyms<br />
    58. 58. Possibilities are endless, enjoy experimenting with them!<br />
    59. 59. Just one warning…<br />
    60. 60. Always use the same analysis rules when indexing and when parsing search text entered by the user!<br />
    61. 61. I bet you want to start working with this<br />
    62. 62. Implementations<br />Lucene (Java main, .NET, Python, C )<br />SOLR if using from other languages<br />Xapian<br />Sphinx<br />OpenFTS<br />MySQL Full-Text Search (kind of…)<br />
    63. 63. Related Books<br />
    64. 64. The theory<br />Introduction to Information Retrieval<br />http://nlp.stanford.edu/IR-book/information-retrieval-book.html<br />Warning: contains a lot of math.<br />
    65. 65. The practice (for Lucene at least):<br />Lucene in Action, second edition:<br />http://www.manning.com/hatcher3/<br />Warning: contains a lot of Java<br />
    66. 66. Questions?<br />
    67. 67. Contact me<br />(with interesting problems involving lots of data  )<br />@deathy<br />cristian.vat@gmail.com<br />http://blog.deathy.info/ (yeah…I know…)<br />
    68. 68. Fin.<br />
    69. 69. So where’s the Halloween Party?<br />Happy Halloween !<br />

    ×