I won't delve into specifics or actual implementations. I'll try to present main concepts which come from Information Retrieval theory and also essential components you should be aware of when dealing with any full-text search system. If interested, there could be a future presentation on actual implementations (Lucene in my case).
Java Web Developer-ish. Last 4 years worked mostly on electronic publishing applications: processing/searching/displaying various content sets of various sizes. Passion for big data and lots of it. ( Last weekend I was parallelizing indexing on a 800K document set so it uses as many cores as possible. On Friday I was indexing a data set of 5.8M documents... )
about fulltext search, or search in general
take your pick: lots of pictures, lots of friends, lots of blog posts
actually, scratch that..
fulltext search is usually VERY fast. and by adding your own custom one, you can make it faster for where your specific application needs it most.
Depending on your content and users you can have very specific relevance criteria. You can surprise your users with the quality of results.
various needs for various content- bitch about imobiliare.ro not having search in text or very dynamic filters. Example: cannot search for apartments to rent with internet access...- bitch about geekmeet.ro wordpress search not being able to filter based on category (Timisoara in this case)
"index" = where you add items which you want to find and where you search for them."document" = the basic unit of indexing/searching. Usually one row from the search results list. Could be a book, a chapter, a page, a URL, etc.
Observe the sorting. More on this later...
not quite boolean, but simple enough to understand..
actual implementations vary and it usually shouldn't matter. Just remember that there are fields and documents and each indexed term is indexed for a specific field.
I'm going Lucene here, but any good index/search API will let you customize this process. This is as many have found a good way to structure your process.
punctuation and various mixes of upper/lower-case in tokens.
Bitch about tokenizer/filter options (or lack thereof in Sphinx/MySQL)…
Introduction to Full-Text Search
Introduction to Full-text search<br />
About me<br />Full-time (Mostly) Java Developer<br />Part-time general technical/sysadmin/geeky guy<br />Interested in: hard problems, search, performance, paralellism, scalability<br />
Some more interesting documents<br />Doc1: "The quick brown fox jumps over the lazy dog"<br />Doc2: "All Daleks: Exterminate! Exterminate! EXTERMINATE!! EXTERMINATE!!!"<br />Doc3: "And the final score is: no TARDIS, no screwdriver, two minutes to spare. Who da man?!"<br />
Tokenizer: Breaks up a single string into smaller tokens.<br />
You define what splitting rules are best for you.<br />
Whitespace Tokenizer<br />Just break into tokens wherever there is some space. So we get something like:<br />
Add more filter seasoning until it tastes just right.<br />
Lots of things you can do with filters<br />case normalization<br />removing unwanted/unneeded characters<br />transliteration/normalization of special characters<br />stopwords<br />synonyms<br />
Possibilities are endless, enjoy experimenting with them!<br />