Your SlideShare is downloading. ×
  • Like
Introduction to Full-Text Search
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Introduction to Full-Text Search

  • 1,697 views
Published

 

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,697
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
9
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • I won't delve into specifics or actual implementations. I'll try to present main concepts which come from Information Retrieval theory and also essential components you should be aware of when dealing with any full-text search system. If interested, there could be a future presentation on actual implementations (Lucene in my case).
  • Java Web Developer-ish. Last 4 years worked mostly on electronic publishing applications: processing/searching/displaying various content sets of various sizes. Passion for big data and lots of it. ( Last weekend I was parallelizing indexing on a 800K document set so it uses as many cores as possible. On Friday I was indexing a data set of 5.8M documents... )
  • about fulltext search, or search in general
  • take your pick: lots of pictures, lots of friends, lots of blog posts
  • actually, scratch that..
  • much better..
  • fulltext search is usually VERY fast. and by adding your own custom one, you can make it faster for where your specific application needs it most.
  • Depending on your content and users you can have very specific relevance criteria. You can surprise your users with the quality of results.
  • various needs for various content- bitch about imobiliare.ro not having search in text or very dynamic filters. Example: cannot search for apartments to rent with internet access...- bitch about geekmeet.ro wordpress search not being able to filter based on category (Timisoara in this case)
  • "index" = where you add items which you want to find and where you search for them."document" = the basic unit of indexing/searching. Usually one row from the search results list. Could be a book, a chapter, a page, a URL, etc.
  • Observe the sorting. More on this later...
  • not quite boolean, but simple enough to understand..
  • actual implementations vary and it usually shouldn't matter. Just remember that there are fields and documents and each indexed term is indexed for a specific field.
  • I'm going Lucene here, but any good index/search API will let you customize this process. This is as many have found a good way to structure your process.
  • punctuation and various mixes of upper/lower-case in tokens.
  • Bitch about tokenizer/filter options (or lack thereof in Sphinx/MySQL)…

Transcript

  • 1. Introduction to Full-text search
  • 2. About me
    Full-time (Mostly) Java Developer
    Part-time general technical/sysadmin/geeky guy
    Interested in: hard problems, search, performance, paralellism, scalability
  • 3. Why should you care?
  • 4. Because every application needs search
  • 5. We live in an era of big, complex and connected applications.
  • 6. That means a lot of data
  • 7. But it's no use if you can't find anything!
  • 8. But it's no use if you can't quickly find anything something relevant
  • 9. Quick
  • 10. Relevant
  • 11. Customized Experience
  • 12. Deathy's Tip
    You can't win by being generic, but you can be the best for your specific type of content.
  • 13. So back to our full-text search...
  • 14. Some core ideas
    "index" (or "inverted index")
    "document"
  • 15. Deathy’s Tip
    Don't be too quick in deciding what a "document" is. Put some thought into it or you'll regret it (speaking from a lot of experience)
  • 16. First we need some documents, more specifically some text samples
  • 17. Documents
    Doc1: "The cow says moo"
    Doc2: "The dog says woof"
    Doc3: "The cow-dog says moof“
    "Stolen" from http://www.slideshare.net/tomdyson/being-google
  • 18. Important: individual words are the basis for the index
  • 19. Individual words
    index = [
    "cow",
    "dog",
    "moo",
    "moof",
    "The",
    "says",
    "woof"
    ]
  • 20. For each word we have a list of documents to which it belongs
  • 21. Words, with appearances
    index = {
    "cow": ["Doc1", "Doc3"],
    "dog": ["Doc2", "Doc3"],
    "moo": ["Doc1"],
    "moof": ["Doc3"],
    "The": ["Doc1", "Doc2", "Doc3"],
    "says": ["Doc1", "Doc2", "Doc3"],
    "woof": ["Doc2"]
    }
  • 22. Q1: Find documents which contain "moo"
    A1: index["moo"]
  • 23. Q2: Find documents which contain "The" and "dog"
    A2: set(index["The"]) & set(index["dog"])
  • 24. Try to think of search as unions/intersections or other filters on sets.
  • 25. Most searches are using simple terms and "boolean" operators.
  • 26. “boolean”
    "word" - word MAY/SHOULD appear in document
    "+word" - word MUST appear in document
    "-word" - word MUST NOT appear in document
  • 27. Example
    Query: “+type:bookcontent:javacontent:python -content:ruby”
    Find books, with "java" or "python" in content but which don't contain "ruby" in content.
  • 28. Err...wait...what the hell does "content:java" mean?
  • 29. Reviewing the "document" concept
  • 30. An index consists out of one or more documents
  • 31. Each document consists of one or more "field"s. Each field has a name and content.
  • 32. Field examples
    content
    title
    author
    publication date
    etc.
  • 33. So how are fields handled internally?
    In most cases very simple. A word belongs to a specific field, so it can be stored in the term directly.
  • 34. New index example
    index = {
    "content:cow": ["Doc1", "Doc3"],
    "content:dog": ["Doc2", "Doc3"],
    "content:moo": ["Doc1"],
    "content:moof": ["Doc3"],
    "content:The": ["Doc1", "Doc2", "Doc3"],
    "content:says": ["Doc1", "Doc2", "Doc3"],
    "content:woof": ["Doc2"],
    "type:example_documents": ["Doc1", "Doc2", "Doc3"]
    }
  • 35. But enough of that
  • 36. We missed the most important thing!
  • 37. We missedsaved the most important thing for last!
  • 38. Analysis
  • 39. or for mortals: how you get from a long text to small tokens/words/terms
  • 40. …borrowing from Lucene naming/API...
  • 41. (One) Tokenizer
  • 42. and zero or more Filters
  • 43. First...
  • 44. Some more interesting documents
    Doc1: "The quick brown fox jumps over the lazy dog"
    Doc2: "All Daleks: Exterminate! Exterminate! EXTERMINATE!! EXTERMINATE!!!"
    Doc3: "And the final score is: no TARDIS, no screwdriver, two minutes to spare. Who da man?!"
  • 45. Tokenizer: Breaks up a single string into smaller tokens.
  • 46. You define what splitting rules are best for you.
  • 47. Whitespace Tokenizer
    Just break into tokens wherever there is some space. So we get something like:
  • 48. Doc1: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
    Doc2: ["All", "Daleks:", "Exterminate!", "Exterminate!", "EXTERMINATE!!", "EXTERMINATE!!!"]
    Doc3: ["And", "the", "final", "score", "is:", "no", "TARDIS,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "Who", "da", "man?!"]
  • 49. But wait, that doesn't look right...
  • 50. So we apply Filters
  • 51. Filter
    transforms one single token into another single token, multiple tokens or no token at all
    you can apply more of them in a specific order
  • 52. Filter 1: lower-case (since we don't want the search to be case-sensitive)
  • 53. Result
    Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
    Doc2: ["all", "daleks:", "exterminate!", "exterminate!", "exterminate!!", "exterminate!!!"]
    Doc3: ["and", "the", "final", "score", "is:", "no", "tardis,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "who", "da", "man?!"]
  • 54. Filter 2: remove punctuation
  • 55. Result
    Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
    Doc2: ["all", "daleks", "exterminate", "exterminate", "exterminate", "exterminate"]
    Doc3: ["and", "the", "final", "score", "is", "no", "tardis", "no", "screwdriver", "two", "minutes", "to", "spare", "who", "da", "man"]
  • 56. Add more filter seasoning until it tastes just right.
  • 57. Lots of things you can do with filters
    case normalization
    removing unwanted/unneeded characters
    transliteration/normalization of special characters
    stopwords
    synonyms
  • 58. Possibilities are endless, enjoy experimenting with them!
  • 59. Just one warning…
  • 60. Always use the same analysis rules when indexing and when parsing search text entered by the user!
  • 61. I bet you want to start working with this
  • 62. Implementations
    Lucene (Java main, .NET, Python, C )
    SOLR if using from other languages
    Xapian
    Sphinx
    OpenFTS
    MySQL Full-Text Search (kind of…)
  • 63. Related Books
  • 64. The theory
    Introduction to Information Retrieval
    http://nlp.stanford.edu/IR-book/information-retrieval-book.html
    Warning: contains a lot of math.
  • 65. The practice (for Lucene at least):
    Lucene in Action, second edition:
    http://www.manning.com/hatcher3/
    Warning: contains a lot of Java
  • 66. Questions?
  • 67. Contact me
    (with interesting problems involving lots of data  )
    @deathy
    cristian.vat@gmail.com
    http://blog.deathy.info/ (yeah…I know…)
  • 68. Fin.
  • 69. So where’s the Halloween Party?
    Happy Halloween !