SlideShare a Scribd company logo
1 of 22
The Search Engine Index http://scienceforseo.blogspot.com IR tutorial series: Part 1
What is an index? The word “index” can mean many things in computing, but in the case of search engines, it can be defined as: A database where information (after being collected, parsed and processed) is stored to allow for quick retrieval. Cache-based engines store the index along with the corpus (collection of documents).  When something is added to the corpus, the index is updated.
“Index” We call it that because it's exactly what we called it when it was one of these: And that took its name from the index finger Photo from: http://www.homeschoolinthewoods.com
Why use an index? If we didn't have an index, it would take too much time to search through the whole corpus to find documents that matched our query.  Creating an index means that the retrieval process is faster and the accuracy is better. The search engine doesn't need to scan each document to know what it's about – this saves on storage and makes the whole process faster.
Some things we need to think about ,[object Object],[object Object],[object Object],[object Object],[object Object]
Indexing methods ,[object Object],[object Object],[object Object],[object Object],[object Object]
The inverted index It is an index which has terms marked as keys.  These map to the document they appear in.  The index is sorted by its keys and works well with Boolean operators (AND,OR, AND NOT) We find the documents by matching the terms – this is why we say it is inverted. Diagram by http://developer.apple.com/
Limitations It can only tell us if a word occurs in a particular document. It can't tell us how often it occurs or its location in the document, it also can't rank those documents either. That information is very important because it helps the search engine determine how relevant to a query a document is. so... we look at  latent semantic indexing (LSI)
LSI “ Semantic” = meaning “ Latent” = present but hidden It is the analysis of the hidden meaning of words and how often they occur in a document. It can infer meaning from words which isn't obvious: Computer – PC – Laptop => connected It can put together documents that are not obviously created. It can do this because it creates a “latent semantic space”
How does LSI work? It uses lots of vectors and creates a “term document matrix” from all the documents it has. Then 3 matrices are created using SVD (“singular value decomposition”) Of these 3 vectors, the 2 nd  contains the singular values of the original matrix in a diagonal matrix Sets of documents are represented as d-dimensional vectors Using the cosine of the angle between these vectors, there is  now an easy-to-calculate similarity measure between any two sets of terms and/or documents.
A quick sketch of LSI Sets of terms and documents = d-dimensional vectors  There are however some big limitations to this method.... Term document  matrix Box of documents Lots of vectors Matrix 1 Matrix 2 Matrix 3
The resulting dimensions can be very difficult to interpret so there are mistakes.  It's unclear what the resulting similarities between terms really mean.  The input is a bag-of-words so we don't have any text structure information. A compound term (“bull-headed”) is treated as 2 terms. Ambiguous terms create noise in the vector space There's no way to define the optimal dimensionality of the vector space There's a time complexity for SVD in dynamic collections
PLSI “ Probabilistic latent semantic indexing” is a better choice because: It has a more robust statistical foundation and provides a proper generative data model It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise)) - this makes it far more flexible It can deal with domain specific synonymy and  polysemous  words
What did all that mean? “ Generative data model” -  It's used for randomly generating observed data from unknown parameters (HMMs are generative data models for example) “ EM algorithm” - it finds the maximum likelihood estimate of parameters in a probabilistic model (where the model depends on unobserved latent variables) – good for machine learning and data clustering. Synonymy – It's the synonym relation between words.  A synonym is when 2 different words mean the same thing. Polysemous – a word that has multiple meanings or interpretations
How does it work? ,[object Object],[object Object],[object Object],[object Object]
How is it different to LSI? The order of the words is lost (but results are still good due to word co-occurrence) Documents can be represented by numeric vectors in a space of words It retrieves topics Each query uses the cosine similarity metric to find the similarity between vectors.
More indexing difficulties It's easy for us to pick a document and classify it, well most of the time, but search engines have other difficulties to over come before even getting to the classification stage.
Tokenization Machines don't understand sentences in text. They see everything in bytes. Consider: The dog ran in the field We see 6 words. Machine sees 24 characters (chars) The words found in a document are called “tokens”.  Information is extracted from documents to be placed in the index.  The tokens may be email addresses, words, URLs,... The Part-Of-Speech, line number, sentence number, size and so on can be stored in the index.
Section recognition Before tokenization happens, all the major parts of a document are identified.  Some documents are newsletters other have a side navigation, some are reports...and the text can be displayed in columns.  Machines will read this sequentially though and index the word sequentially as well. The difficulty is finding which view of the document is informative. Some engines will index an abstract representation of the document instead.  Most engines don't though. This is also why using JavaScript for example is avoided.
Formats Documents come in all flavours on the web.  There are documents in HTML, PDF, EXCEL, Powerpoint, and so many others. Before documents are analysed, they are stripped down and the formatting extracted.  They are "normalised". It's important for the search engine to not misread "markup" information for content or the index gets polluted.
To conclude... The indexing process of a search engine is really very important because if this is wrong, everything is wrong.  This is why “Spamdexing” is such an issue. There are a lot of very specialised areas of computing who focus their work on making it easier for machines to create an index.  Don't let this short presentation fool you, it is a very very big research issue.  Natural language processing is used for rich text analysis, which helps identify what's going on so that the other computational elements can do their job.
Resources The inverted index in detail  http://tinyurl.com/65hbfd   The seminal PLSI paper  http://tinyurl.com/54wd76 The seminal LSI paper  http://tinyurl.com/5e8v36 The semantic indexing project  http://knowledgesearch.org/ Boulder Uni on LSA  http://lsa.colorado.edu/ Apache Lucene  http://lucene.apache.org/java/docs/ Google test data ($150)  http://tinyurl.com/62t4la

More Related Content

What's hot

REST - Representational state transfer
REST - Representational state transferREST - Representational state transfer
REST - Representational state transfer
Tricode (part of Dept)
 
Dynamic Information Retrieval Tutorial - SIGIR 2015
Dynamic Information Retrieval Tutorial - SIGIR 2015Dynamic Information Retrieval Tutorial - SIGIR 2015
Dynamic Information Retrieval Tutorial - SIGIR 2015
Marc Sloan
 

What's hot (20)

Inverted index
Inverted indexInverted index
Inverted index
 
REST - Representational state transfer
REST - Representational state transferREST - Representational state transfer
REST - Representational state transfer
 
HTTP & WWW
HTTP & WWWHTTP & WWW
HTTP & WWW
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Introduction to HTTP
Introduction to HTTPIntroduction to HTTP
Introduction to HTTP
 
Architecture of a search engine
Architecture of a search engineArchitecture of a search engine
Architecture of a search engine
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Components of a search engine
Components of a search engineComponents of a search engine
Components of a search engine
 
web service technologies
web service technologiesweb service technologies
web service technologies
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 
How Email Works
How Email WorksHow Email Works
How Email Works
 
Link Analysis for Web Information Retrieval
Link Analysis for Web Information RetrievalLink Analysis for Web Information Retrieval
Link Analysis for Web Information Retrieval
 
Dynamic Information Retrieval Tutorial - SIGIR 2015
Dynamic Information Retrieval Tutorial - SIGIR 2015Dynamic Information Retrieval Tutorial - SIGIR 2015
Dynamic Information Retrieval Tutorial - SIGIR 2015
 
Html links
Html linksHtml links
Html links
 
Soap and restful webservice
Soap and restful webserviceSoap and restful webservice
Soap and restful webservice
 
Uniform Resource Locator (URL)
Uniform Resource Locator (URL)Uniform Resource Locator (URL)
Uniform Resource Locator (URL)
 
Restful web services ppt
Restful web services pptRestful web services ppt
Restful web services ppt
 
HTTP Basics
HTTP BasicsHTTP Basics
HTTP Basics
 
Web Services
Web ServicesWeb Services
Web Services
 

Viewers also liked

Search Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content SpamSearch Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content Spam
jagadish thaker
 
Identity Theft Presentation
Identity Theft PresentationIdentity Theft Presentation
Identity Theft Presentation
Randall Chesnutt
 
Richard kwock jsm 2012 poster
Richard kwock jsm 2012 posterRichard kwock jsm 2012 poster
Richard kwock jsm 2012 poster
Ajay Ohri
 
Keyword Searching: Advanced Techniques
Keyword Searching: Advanced TechniquesKeyword Searching: Advanced Techniques
Keyword Searching: Advanced Techniques
Kris Jacobson
 

Viewers also liked (20)

Search Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content SpamSearch Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content Spam
 
Optical Mark Recognition
Optical Mark RecognitionOptical Mark Recognition
Optical Mark Recognition
 
Cybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse CasesCybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse Cases
 
Identity Theft Presentation
Identity Theft PresentationIdentity Theft Presentation
Identity Theft Presentation
 
Mac281 Open Source software
Mac281 Open Source softwareMac281 Open Source software
Mac281 Open Source software
 
Cyber Terrorism
Cyber TerrorismCyber Terrorism
Cyber Terrorism
 
Parts of cpu
Parts of cpuParts of cpu
Parts of cpu
 
Search engines and its types
Search engines and its typesSearch engines and its types
Search engines and its types
 
Types of Search Engines
Types of Search EnginesTypes of Search Engines
Types of Search Engines
 
Port mann bridge modification
Port mann bridge modificationPort mann bridge modification
Port mann bridge modification
 
Presentation search strategy
Presentation   search strategyPresentation   search strategy
Presentation search strategy
 
Richard kwock jsm 2012 poster
Richard kwock jsm 2012 posterRichard kwock jsm 2012 poster
Richard kwock jsm 2012 poster
 
POPSI
POPSIPOPSI
POPSI
 
From KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G LindquistFrom KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G Lindquist
 
Keyword Searching: Advanced Techniques
Keyword Searching: Advanced TechniquesKeyword Searching: Advanced Techniques
Keyword Searching: Advanced Techniques
 
3rd Thesaurus
3rd Thesaurus3rd Thesaurus
3rd Thesaurus
 
Lawrence kwockresume1
Lawrence kwockresume1Lawrence kwockresume1
Lawrence kwockresume1
 
Advanced keyword research
Advanced keyword researchAdvanced keyword research
Advanced keyword research
 
Searching techniques
Searching techniquesSearching techniques
Searching techniques
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional Repositories
 

Similar to The search engine index

Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14
Steven Toole
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
maxfalc
 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
John Felahi
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
ijsrd.com
 

Similar to The search engine index (20)

Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14
 
IRJET - BOT Virtual Guide
IRJET -  	  BOT Virtual GuideIRJET -  	  BOT Virtual Guide
IRJET - BOT Virtual Guide
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
G04124041046
G04124041046G04124041046
G04124041046
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementations
 
XXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerXXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair Kerner
 
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdfleewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
 
Oops Concepts
Oops ConceptsOops Concepts
Oops Concepts
 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood Knowledge
 

More from CJ Jenkins (7)

I am an experience designer
I am an experience designer I am an experience designer
I am an experience designer
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systems
 
Knowledgebase vs Database
Knowledgebase vs DatabaseKnowledgebase vs Database
Knowledgebase vs Database
 
Building a semantic website
Building a semantic websiteBuilding a semantic website
Building a semantic website
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
Twitter for business
Twitter for businessTwitter for business
Twitter for business
 

Recently uploaded

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

The search engine index

  • 1. The Search Engine Index http://scienceforseo.blogspot.com IR tutorial series: Part 1
  • 2. What is an index? The word “index” can mean many things in computing, but in the case of search engines, it can be defined as: A database where information (after being collected, parsed and processed) is stored to allow for quick retrieval. Cache-based engines store the index along with the corpus (collection of documents). When something is added to the corpus, the index is updated.
  • 3. “Index” We call it that because it's exactly what we called it when it was one of these: And that took its name from the index finger Photo from: http://www.homeschoolinthewoods.com
  • 4. Why use an index? If we didn't have an index, it would take too much time to search through the whole corpus to find documents that matched our query. Creating an index means that the retrieval process is faster and the accuracy is better. The search engine doesn't need to scan each document to know what it's about – this saves on storage and makes the whole process faster.
  • 5.
  • 6.
  • 7. The inverted index It is an index which has terms marked as keys. These map to the document they appear in. The index is sorted by its keys and works well with Boolean operators (AND,OR, AND NOT) We find the documents by matching the terms – this is why we say it is inverted. Diagram by http://developer.apple.com/
  • 8. Limitations It can only tell us if a word occurs in a particular document. It can't tell us how often it occurs or its location in the document, it also can't rank those documents either. That information is very important because it helps the search engine determine how relevant to a query a document is. so... we look at latent semantic indexing (LSI)
  • 9. LSI “ Semantic” = meaning “ Latent” = present but hidden It is the analysis of the hidden meaning of words and how often they occur in a document. It can infer meaning from words which isn't obvious: Computer – PC – Laptop => connected It can put together documents that are not obviously created. It can do this because it creates a “latent semantic space”
  • 10. How does LSI work? It uses lots of vectors and creates a “term document matrix” from all the documents it has. Then 3 matrices are created using SVD (“singular value decomposition”) Of these 3 vectors, the 2 nd contains the singular values of the original matrix in a diagonal matrix Sets of documents are represented as d-dimensional vectors Using the cosine of the angle between these vectors, there is now an easy-to-calculate similarity measure between any two sets of terms and/or documents.
  • 11. A quick sketch of LSI Sets of terms and documents = d-dimensional vectors There are however some big limitations to this method.... Term document matrix Box of documents Lots of vectors Matrix 1 Matrix 2 Matrix 3
  • 12. The resulting dimensions can be very difficult to interpret so there are mistakes. It's unclear what the resulting similarities between terms really mean. The input is a bag-of-words so we don't have any text structure information. A compound term (“bull-headed”) is treated as 2 terms. Ambiguous terms create noise in the vector space There's no way to define the optimal dimensionality of the vector space There's a time complexity for SVD in dynamic collections
  • 13. PLSI “ Probabilistic latent semantic indexing” is a better choice because: It has a more robust statistical foundation and provides a proper generative data model It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise)) - this makes it far more flexible It can deal with domain specific synonymy and polysemous words
  • 14. What did all that mean? “ Generative data model” - It's used for randomly generating observed data from unknown parameters (HMMs are generative data models for example) “ EM algorithm” - it finds the maximum likelihood estimate of parameters in a probabilistic model (where the model depends on unobserved latent variables) – good for machine learning and data clustering. Synonymy – It's the synonym relation between words. A synonym is when 2 different words mean the same thing. Polysemous – a word that has multiple meanings or interpretations
  • 15.
  • 16. How is it different to LSI? The order of the words is lost (but results are still good due to word co-occurrence) Documents can be represented by numeric vectors in a space of words It retrieves topics Each query uses the cosine similarity metric to find the similarity between vectors.
  • 17. More indexing difficulties It's easy for us to pick a document and classify it, well most of the time, but search engines have other difficulties to over come before even getting to the classification stage.
  • 18. Tokenization Machines don't understand sentences in text. They see everything in bytes. Consider: The dog ran in the field We see 6 words. Machine sees 24 characters (chars) The words found in a document are called “tokens”. Information is extracted from documents to be placed in the index. The tokens may be email addresses, words, URLs,... The Part-Of-Speech, line number, sentence number, size and so on can be stored in the index.
  • 19. Section recognition Before tokenization happens, all the major parts of a document are identified. Some documents are newsletters other have a side navigation, some are reports...and the text can be displayed in columns. Machines will read this sequentially though and index the word sequentially as well. The difficulty is finding which view of the document is informative. Some engines will index an abstract representation of the document instead. Most engines don't though. This is also why using JavaScript for example is avoided.
  • 20. Formats Documents come in all flavours on the web. There are documents in HTML, PDF, EXCEL, Powerpoint, and so many others. Before documents are analysed, they are stripped down and the formatting extracted. They are "normalised". It's important for the search engine to not misread "markup" information for content or the index gets polluted.
  • 21. To conclude... The indexing process of a search engine is really very important because if this is wrong, everything is wrong. This is why “Spamdexing” is such an issue. There are a lot of very specialised areas of computing who focus their work on making it easier for machines to create an index. Don't let this short presentation fool you, it is a very very big research issue. Natural language processing is used for rich text analysis, which helps identify what's going on so that the other computational elements can do their job.
  • 22. Resources The inverted index in detail http://tinyurl.com/65hbfd The seminal PLSI paper http://tinyurl.com/54wd76 The seminal LSI paper http://tinyurl.com/5e8v36 The semantic indexing project http://knowledgesearch.org/ Boulder Uni on LSA http://lsa.colorado.edu/ Apache Lucene http://lucene.apache.org/java/docs/ Google test data ($150) http://tinyurl.com/62t4la