<ul><li>that’s information like garbage </li></ul><ul><li>no structure </li></ul><ul><li>comes in all kinds of </li></ul><ul><li> shapes, sizes, formats </li></ul>
<ul><li>And this is what indexing does </li></ul><ul><li>Makes data accessible in a structured format , easily accessible through search. </li></ul>
so what all needs to be Indexed and Searched ?
various FILE FORMATS Text Files HTML PDF MS Word PPT
coming from various DATA SOURCES Emails CMS File System Database Web Pages
data ( documents ) INDEX FILES user sends search query receives search results Analyzer fed to text that should be indexed removing stop words such as "a" or "the" converting all text to lowercase letters for case-insensitive searching Stemming (A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish". )- Index Writer tokenized text
Document 1: Coffee isn't my cup of tea. Document 2: Chocolate, men, coffee - some things are better rich. INDEX coffee - 1,2 cup - 1 tea - 1 chocolate - 1 men - 1 things - 1 better - 1 rich - 1
<ul><li>Full-text search library </li></ul><ul><li>Open Source </li></ul><ul><li>Documents in xml format </li></ul><ul><li>Can operate on its own or via Solr </li></ul>
Ways of storing fields of any document: Indexed means it is searchable Stored you may chose not to make a field searchable, means the content can be displayed in the search results. Example : “ summary associated with a page ” Tokenized means it is run through an Analyzer , that converts the content into a sequence of tokens
<ul><li>open source </li></ul><ul><li>handles index/Query to Lucene via HTTP and XML ( also JSON ) </li></ul><ul><li>manages document update , add and delete requests to Lucene </li></ul><ul><li>straightforward schema and config files </li></ul><ul><li>comprehensive HTML Admin Interfaces </li></ul><ul><li>highly configurable </li></ul>
Default Parameters http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price param default description q The query start 0 Offset into the list of matches rows 10 Number of documents to return fl * Stored fields to return qt standard Query type; maps to query handler df (schema) Default field to search
Solr Core Lucene Admin Interface Standard Request Handler Disjunction Max Request Handler Custom Request Handler Update Handler Caching XML Update Interface Config Analysis HTTP Request Servlet Concurrency Update Servlet XML Response Writer Replication Schema Search Requests hit here New document to be added here
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.