The Search Engine Index http://scienceforseo.blogspot.com IR tutorial series: Part 1
What is an index? The word “index” can mean many things in computing, but in the case of search engines, it can be defined as: A database where information (after being collected, parsed and processed) is stored to allow for quick retrieval. Cache-based engines store the index along with the corpus (collection of documents). When something is added to the corpus, the index is updated.
“Index” We call it that because it's exactly what we called it when it was one of these: And that took its name from the index finger Photo from: http://www.homeschoolinthewoods.com
Why use an index? If we didn't have an index, it would take too much time to search through the whole corpus to find documents that matched our query. Creating an index means that the retrieval process is faster and the accuracy is better. The search engine doesn't need to scan each document to know what it's about – this saves on storage and makes the whole process faster.
Some things we need to think about <ul><li>We need to know whether new data is being added or if old data is being updated – we don't want duplicate data in our index, because it uses up space and is of no use. </li></ul><ul><li>We need to assign an ID to each document coming in. </li></ul><ul><li>How quickly can it find things? </li></ul><ul><li>How is all of this going to be stored? </li></ul><ul><li>How will we collect all of the data? </li></ul>
Indexing methods <ul><li>Different methods exist to cope with indexing but search engines tend to use the following: </li></ul><ul><li>An inverted index </li></ul><ul><li>A term document matrix </li></ul><ul><li>The most used type is the “ inverted index ”. This is where words are mapped to their location in a document. It's a “sparse matrix” because it doesn't list all of the words in each document. </li></ul><ul><li>A “ fully inverted index ” is when words are not only mapped to their location in the documents, but also the location of each occurrence of the word is also mapped. </li></ul>
The inverted index It is an index which has terms marked as keys. These map to the document they appear in. The index is sorted by its keys and works well with Boolean operators (AND,OR, AND NOT) We find the documents by matching the terms – this is why we say it is inverted. Diagram by http://developer.apple.com/
Limitations It can only tell us if a word occurs in a particular document. It can't tell us how often it occurs or its location in the document, it also can't rank those documents either. That information is very important because it helps the search engine determine how relevant to a query a document is. so... we look at latent semantic indexing (LSI)
LSI “ Semantic” = meaning “ Latent” = present but hidden It is the analysis of the hidden meaning of words and how often they occur in a document. It can infer meaning from words which isn't obvious: Computer – PC – Laptop => connected It can put together documents that are not obviously created. It can do this because it creates a “latent semantic space”
How does LSI work? It uses lots of vectors and creates a “term document matrix” from all the documents it has. Then 3 matrices are created using SVD (“singular value decomposition”) Of these 3 vectors, the 2 nd contains the singular values of the original matrix in a diagonal matrix Sets of documents are represented as d-dimensional vectors Using the cosine of the angle between these vectors, there is now an easy-to-calculate similarity measure between any two sets of terms and/or documents.
A quick sketch of LSI Sets of terms and documents = d-dimensional vectors There are however some big limitations to this method.... Term document matrix Box of documents Lots of vectors Matrix 1 Matrix 2 Matrix 3
The resulting dimensions can be very difficult to interpret so there are mistakes. It's unclear what the resulting similarities between terms really mean. The input is a bag-of-words so we don't have any text structure information. A compound term (“bull-headed”) is treated as 2 terms. Ambiguous terms create noise in the vector space There's no way to define the optimal dimensionality of the vector space There's a time complexity for SVD in dynamic collections
PLSI “ Probabilistic latent semantic indexing” is a better choice because: It has a more robust statistical foundation and provides a proper generative data model It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise)) - this makes it far more flexible It can deal with domain specific synonymy and polysemous words
What did all that mean? “ Generative data model” - It's used for randomly generating observed data from unknown parameters (HMMs are generative data models for example) “ EM algorithm” - it finds the maximum likelihood estimate of parameters in a probabilistic model (where the model depends on unobserved latent variables) – good for machine learning and data clustering. Synonymy – It's the synonym relation between words. A synonym is when 2 different words mean the same thing. Polysemous – a word that has multiple meanings or interpretations
How does it work? <ul><li>It analyses co-occurrences of words and documents </li></ul><ul><li>It tells us what terms give us a topic (term by topic matrix - topic is also called an "aspect"), and which topics are in which documents term by document matrix) </li></ul><ul><li>It uses aspects to create queries </li></ul><ul><li>LSI uses the Gaussian model (normal distribution) that can generate negative values and you can't have a negative number of words in a document - PLSI uses a multi-nominal model (probability distribution) which works better. </li></ul>
How is it different to LSI? The order of the words is lost (but results are still good due to word co-occurrence) Documents can be represented by numeric vectors in a space of words It retrieves topics Each query uses the cosine similarity metric to find the similarity between vectors.
More indexing difficulties It's easy for us to pick a document and classify it, well most of the time, but search engines have other difficulties to over come before even getting to the classification stage.
Tokenization Machines don't understand sentences in text. They see everything in bytes. Consider: The dog ran in the field We see 6 words. Machine sees 24 characters (chars) The words found in a document are called “tokens”. Information is extracted from documents to be placed in the index. The tokens may be email addresses, words, URLs,... The Part-Of-Speech, line number, sentence number, size and so on can be stored in the index.
Formats Documents come in all flavours on the web. There are documents in HTML, PDF, EXCEL, Powerpoint, and so many others. Before documents are analysed, they are stripped down and the formatting extracted. They are "normalised". It's important for the search engine to not misread "markup" information for content or the index gets polluted.
To conclude... The indexing process of a search engine is really very important because if this is wrong, everything is wrong. This is why “Spamdexing” is such an issue. There are a lot of very specialised areas of computing who focus their work on making it easier for machines to create an index. Don't let this short presentation fool you, it is a very very big research issue. Natural language processing is used for rich text analysis, which helps identify what's going on so that the other computational elements can do their job.
Resources The inverted index in detail http://tinyurl.com/65hbfd The seminal PLSI paper http://tinyurl.com/54wd76 The seminal LSI paper http://tinyurl.com/5e8v36 The semantic indexing project http://knowledgesearch.org/ Boulder Uni on LSA http://lsa.colorado.edu/ Apache Lucene http://lucene.apache.org/java/docs/ Google test data ($150) http://tinyurl.com/62t4la