SlideShare a Scribd company logo
1 of 22
The Search Engine Index http://scienceforseo.blogspot.com IR tutorial series: Part 1
What is an index? The word “index” can mean many things in computing, but in the case of search engines, it can be defined as: A database where information (after being collected, parsed and processed) is stored to allow for quick retrieval. Cache-based engines store the index along with the corpus (collection of documents).  When something is added to the corpus, the index is updated.
“Index” We call it that because it's exactly what we called it when it was one of these: And that took its name from the index finger Photo from: http://www.homeschoolinthewoods.com
Why use an index? If we didn't have an index, it would take too much time to search through the whole corpus to find documents that matched our query.  Creating an index means that the retrieval process is faster and the accuracy is better. The search engine doesn't need to scan each document to know what it's about – this saves on storage and makes the whole process faster.
Some things we need to think about ,[object Object],[object Object],[object Object],[object Object],[object Object]
Indexing methods ,[object Object],[object Object],[object Object],[object Object],[object Object]
The inverted index It is an index which has terms marked as keys.  These map to the document they appear in.  The index is sorted by its keys and works well with Boolean operators (AND,OR, AND NOT) We find the documents by matching the terms – this is why we say it is inverted. Diagram by http://developer.apple.com/
Limitations It can only tell us if a word occurs in a particular document. It can't tell us how often it occurs or its location in the document, it also can't rank those documents either. That information is very important because it helps the search engine determine how relevant to a query a document is. so... we look at  latent semantic indexing (LSI)
LSI “ Semantic” = meaning “ Latent” = present but hidden It is the analysis of the hidden meaning of words and how often they occur in a document. It can infer meaning from words which isn't obvious: Computer – PC – Laptop => connected It can put together documents that are not obviously created. It can do this because it creates a “latent semantic space”
How does LSI work? It uses lots of vectors and creates a “term document matrix” from all the documents it has. Then 3 matrices are created using SVD (“singular value decomposition”) Of these 3 vectors, the 2 nd  contains the singular values of the original matrix in a diagonal matrix Sets of documents are represented as d-dimensional vectors Using the cosine of the angle between these vectors, there is  now an easy-to-calculate similarity measure between any two sets of terms and/or documents.
A quick sketch of LSI Sets of terms and documents = d-dimensional vectors  There are however some big limitations to this method.... Term document  matrix Box of documents Lots of vectors Matrix 1 Matrix 2 Matrix 3
The resulting dimensions can be very difficult to interpret so there are mistakes.  It's unclear what the resulting similarities between terms really mean.  The input is a bag-of-words so we don't have any text structure information. A compound term (“bull-headed”) is treated as 2 terms. Ambiguous terms create noise in the vector space There's no way to define the optimal dimensionality of the vector space There's a time complexity for SVD in dynamic collections
PLSI “ Probabilistic latent semantic indexing” is a better choice because: It has a more robust statistical foundation and provides a proper generative data model It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise)) - this makes it far more flexible It can deal with domain specific synonymy and  polysemous  words
What did all that mean? “ Generative data model” -  It's used for randomly generating observed data from unknown parameters (HMMs are generative data models for example) “ EM algorithm” - it finds the maximum likelihood estimate of parameters in a probabilistic model (where the model depends on unobserved latent variables) – good for machine learning and data clustering. Synonymy – It's the synonym relation between words.  A synonym is when 2 different words mean the same thing. Polysemous – a word that has multiple meanings or interpretations
How does it work? ,[object Object],[object Object],[object Object],[object Object]
How is it different to LSI? The order of the words is lost (but results are still good due to word co-occurrence) Documents can be represented by numeric vectors in a space of words It retrieves topics Each query uses the cosine similarity metric to find the similarity between vectors.
More indexing difficulties It's easy for us to pick a document and classify it, well most of the time, but search engines have other difficulties to over come before even getting to the classification stage.
Tokenization Machines don't understand sentences in text. They see everything in bytes. Consider: The dog ran in the field We see 6 words. Machine sees 24 characters (chars) The words found in a document are called “tokens”.  Information is extracted from documents to be placed in the index.  The tokens may be email addresses, words, URLs,... The Part-Of-Speech, line number, sentence number, size and so on can be stored in the index.
Section recognition Before tokenization happens, all the major parts of a document are identified.  Some documents are newsletters other have a side navigation, some are reports...and the text can be displayed in columns.  Machines will read this sequentially though and index the word sequentially as well. The difficulty is finding which view of the document is informative. Some engines will index an abstract representation of the document instead.  Most engines don't though. This is also why using JavaScript for example is avoided.
Formats Documents come in all flavours on the web.  There are documents in HTML, PDF, EXCEL, Powerpoint, and so many others. Before documents are analysed, they are stripped down and the formatting extracted.  They are "normalised". It's important for the search engine to not misread "markup" information for content or the index gets polluted.
To conclude... The indexing process of a search engine is really very important because if this is wrong, everything is wrong.  This is why “Spamdexing” is such an issue. There are a lot of very specialised areas of computing who focus their work on making it easier for machines to create an index.  Don't let this short presentation fool you, it is a very very big research issue.  Natural language processing is used for rich text analysis, which helps identify what's going on so that the other computational elements can do their job.
Resources The inverted index in detail  http://tinyurl.com/65hbfd   The seminal PLSI paper  http://tinyurl.com/54wd76 The seminal LSI paper  http://tinyurl.com/5e8v36 The semantic indexing project  http://knowledgesearch.org/ Boulder Uni on LSA  http://lsa.colorado.edu/ Apache Lucene  http://lucene.apache.org/java/docs/ Google test data ($150)  http://tinyurl.com/62t4la

More Related Content

What's hot

Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech TalksIntroducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech TalksAmazon Web Services
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval ModelsNisha Arankandath
 
AWS Redshift Analyzeの必要性とvacuumの落とし穴
AWS Redshift Analyzeの必要性とvacuumの落とし穴AWS Redshift Analyzeの必要性とvacuumの落とし穴
AWS Redshift Analyzeの必要性とvacuumの落とし穴Moto Fukao
 
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...Amazon Web Services Korea
 
20210127 AWS Black Belt Online Seminar Amazon Redshift 運用管理
20210127 AWS Black Belt Online Seminar Amazon Redshift 運用管理20210127 AWS Black Belt Online Seminar Amazon Redshift 運用管理
20210127 AWS Black Belt Online Seminar Amazon Redshift 運用管理Amazon Web Services Japan
 
클라우드 환경으로 데이터베이스 이전하기 - 강민석, AWS SR. Database SA
클라우드 환경으로 데이터베이스 이전하기 - 강민석, AWS SR. Database SA클라우드 환경으로 데이터베이스 이전하기 - 강민석, AWS SR. Database SA
클라우드 환경으로 데이터베이스 이전하기 - 강민석, AWS SR. Database SAAmazon Web Services Korea
 
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...Amazon Web Services
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
Amazon S3 고급 활용 기법 - AWS Summit Seoul 2017
Amazon S3 고급 활용 기법  - AWS Summit Seoul 2017Amazon S3 고급 활용 기법  - AWS Summit Seoul 2017
Amazon S3 고급 활용 기법 - AWS Summit Seoul 2017Amazon Web Services Korea
 
Phrase based Indexing and Information Retrieval
Phrase based Indexing and Information RetrievalPhrase based Indexing and Information Retrieval
Phrase based Indexing and Information RetrievalBala Abirami
 
Guidelines for indexes and related information retrieval devices
Guidelines for indexes and related information retrieval devicesGuidelines for indexes and related information retrieval devices
Guidelines for indexes and related information retrieval devicesLiziel Isidoro
 
Knowledge Representation, Semantic Web
Knowledge Representation, Semantic WebKnowledge Representation, Semantic Web
Knowledge Representation, Semantic WebSerendipity Seraph
 
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingCloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingAmazon Web Services Korea
 
オンプレミスRDBMSをAWSへ移行する手法
オンプレミスRDBMSをAWSへ移行する手法オンプレミスRDBMSをAWSへ移行する手法
オンプレミスRDBMSをAWSへ移行する手法Amazon Web Services Japan
 
효과적인 NoSQL (Elasticahe / DynamoDB) 디자인 및 활용 방안 (최유정 & 최홍식, AWS 솔루션즈 아키텍트) :: ...
효과적인 NoSQL (Elasticahe / DynamoDB) 디자인 및 활용 방안 (최유정 & 최홍식, AWS 솔루션즈 아키텍트) :: ...효과적인 NoSQL (Elasticahe / DynamoDB) 디자인 및 활용 방안 (최유정 & 최홍식, AWS 솔루션즈 아키텍트) :: ...
효과적인 NoSQL (Elasticahe / DynamoDB) 디자인 및 활용 방안 (최유정 & 최홍식, AWS 솔루션즈 아키텍트) :: ...Amazon Web Services Korea
 
AWS Black Belt Online Seminar 2016 Amazon VPC
AWS Black Belt Online Seminar 2016 Amazon VPCAWS Black Belt Online Seminar 2016 Amazon VPC
AWS Black Belt Online Seminar 2016 Amazon VPCAmazon Web Services Japan
 
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...Amazon Web Services Korea
 
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチAmazon Web Services Japan
 

What's hot (20)

Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech TalksIntroducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
 
AWS Redshift Analyzeの必要性とvacuumの落とし穴
AWS Redshift Analyzeの必要性とvacuumの落とし穴AWS Redshift Analyzeの必要性とvacuumの落とし穴
AWS Redshift Analyzeの必要性とvacuumの落とし穴
 
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
 
AWS Black Belt - AWS Glue
AWS Black Belt - AWS GlueAWS Black Belt - AWS Glue
AWS Black Belt - AWS Glue
 
20210127 AWS Black Belt Online Seminar Amazon Redshift 運用管理
20210127 AWS Black Belt Online Seminar Amazon Redshift 運用管理20210127 AWS Black Belt Online Seminar Amazon Redshift 運用管理
20210127 AWS Black Belt Online Seminar Amazon Redshift 運用管理
 
클라우드 환경으로 데이터베이스 이전하기 - 강민석, AWS SR. Database SA
클라우드 환경으로 데이터베이스 이전하기 - 강민석, AWS SR. Database SA클라우드 환경으로 데이터베이스 이전하기 - 강민석, AWS SR. Database SA
클라우드 환경으로 데이터베이스 이전하기 - 강민석, AWS SR. Database SA
 
What is Document Indexing? A tutorial for intelligent data capture.
What is Document Indexing? A tutorial for intelligent data capture.What is Document Indexing? A tutorial for intelligent data capture.
What is Document Indexing? A tutorial for intelligent data capture.
 
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Amazon S3 고급 활용 기법 - AWS Summit Seoul 2017
Amazon S3 고급 활용 기법  - AWS Summit Seoul 2017Amazon S3 고급 활용 기법  - AWS Summit Seoul 2017
Amazon S3 고급 활용 기법 - AWS Summit Seoul 2017
 
Phrase based Indexing and Information Retrieval
Phrase based Indexing and Information RetrievalPhrase based Indexing and Information Retrieval
Phrase based Indexing and Information Retrieval
 
Guidelines for indexes and related information retrieval devices
Guidelines for indexes and related information retrieval devicesGuidelines for indexes and related information retrieval devices
Guidelines for indexes and related information retrieval devices
 
Knowledge Representation, Semantic Web
Knowledge Representation, Semantic WebKnowledge Representation, Semantic Web
Knowledge Representation, Semantic Web
 
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingCloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
オンプレミスRDBMSをAWSへ移行する手法
オンプレミスRDBMSをAWSへ移行する手法オンプレミスRDBMSをAWSへ移行する手法
オンプレミスRDBMSをAWSへ移行する手法
 
효과적인 NoSQL (Elasticahe / DynamoDB) 디자인 및 활용 방안 (최유정 & 최홍식, AWS 솔루션즈 아키텍트) :: ...
효과적인 NoSQL (Elasticahe / DynamoDB) 디자인 및 활용 방안 (최유정 & 최홍식, AWS 솔루션즈 아키텍트) :: ...효과적인 NoSQL (Elasticahe / DynamoDB) 디자인 및 활용 방안 (최유정 & 최홍식, AWS 솔루션즈 아키텍트) :: ...
효과적인 NoSQL (Elasticahe / DynamoDB) 디자인 및 활용 방안 (최유정 & 최홍식, AWS 솔루션즈 아키텍트) :: ...
 
AWS Black Belt Online Seminar 2016 Amazon VPC
AWS Black Belt Online Seminar 2016 Amazon VPCAWS Black Belt Online Seminar 2016 Amazon VPC
AWS Black Belt Online Seminar 2016 Amazon VPC
 
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
 
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
機密データとSaaSは共存しうるのか!?セキュリティー重視のユーザー層を取り込む為のネットワーク通信のアプローチ
 

Viewers also liked

Search Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content SpamSearch Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content Spamjagadish thaker
 
Optical Mark Recognition
Optical Mark RecognitionOptical Mark Recognition
Optical Mark RecognitionHimanshu Popli
 
Cybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse CasesCybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse CasesAshesh R
 
Identity Theft Presentation
Identity Theft PresentationIdentity Theft Presentation
Identity Theft PresentationRandall Chesnutt
 
Mac281 Open Source software
Mac281 Open Source softwareMac281 Open Source software
Mac281 Open Source softwareRob Jewitt
 
Search engines and its types
Search engines and its typesSearch engines and its types
Search engines and its typesNagarjuna Kalluru
 
Port mann bridge modification
Port mann bridge modificationPort mann bridge modification
Port mann bridge modificationjacobkwack
 
Presentation search strategy
Presentation   search strategyPresentation   search strategy
Presentation search strategyjmunks
 
Richard kwock jsm 2012 poster
Richard kwock jsm 2012 posterRichard kwock jsm 2012 poster
Richard kwock jsm 2012 posterAjay Ohri
 
From KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G LindquistFrom KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G Lindquistmglindquist
 
Keyword Searching: Advanced Techniques
Keyword Searching: Advanced TechniquesKeyword Searching: Advanced Techniques
Keyword Searching: Advanced TechniquesKris Jacobson
 
Advanced keyword research
Advanced keyword researchAdvanced keyword research
Advanced keyword researchJono Alderson
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional RepositoriesSarika Sawant
 

Viewers also liked (20)

Search Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content SpamSearch Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content Spam
 
Optical Mark Recognition
Optical Mark RecognitionOptical Mark Recognition
Optical Mark Recognition
 
Cybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse CasesCybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse Cases
 
Identity Theft Presentation
Identity Theft PresentationIdentity Theft Presentation
Identity Theft Presentation
 
Mac281 Open Source software
Mac281 Open Source softwareMac281 Open Source software
Mac281 Open Source software
 
Cyber Terrorism
Cyber TerrorismCyber Terrorism
Cyber Terrorism
 
Parts of cpu
Parts of cpuParts of cpu
Parts of cpu
 
Search engines and its types
Search engines and its typesSearch engines and its types
Search engines and its types
 
Types of Search Engines
Types of Search EnginesTypes of Search Engines
Types of Search Engines
 
Port mann bridge modification
Port mann bridge modificationPort mann bridge modification
Port mann bridge modification
 
Presentation search strategy
Presentation   search strategyPresentation   search strategy
Presentation search strategy
 
Richard kwock jsm 2012 poster
Richard kwock jsm 2012 posterRichard kwock jsm 2012 poster
Richard kwock jsm 2012 poster
 
POPSI
POPSIPOPSI
POPSI
 
From KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G LindquistFrom KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G Lindquist
 
Keyword Searching: Advanced Techniques
Keyword Searching: Advanced TechniquesKeyword Searching: Advanced Techniques
Keyword Searching: Advanced Techniques
 
3rd Thesaurus
3rd Thesaurus3rd Thesaurus
3rd Thesaurus
 
Lawrence kwockresume1
Lawrence kwockresume1Lawrence kwockresume1
Lawrence kwockresume1
 
Advanced keyword research
Advanced keyword researchAdvanced keyword research
Advanced keyword research
 
Searching techniques
Searching techniquesSearching techniques
Searching techniques
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional Repositories
 

Similar to The search engine index

Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14Steven Toole
 
IRJET - BOT Virtual Guide
IRJET -  	  BOT Virtual GuideIRJET -  	  BOT Virtual Guide
IRJET - BOT Virtual GuideIRJET Journal
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantinimaxfalc
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text miningIRJET Journal
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONIJDKP
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerIOSR Journals
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsBen DeMott
 
XXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerXXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerDarrell W. Gunter
 
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdfleewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdfrobertsamuel23
 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperJohn Felahi
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeIJMTST Journal
 

Similar to The search engine index (20)

Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14
 
IRJET - BOT Virtual Guide
IRJET -  	  BOT Virtual GuideIRJET -  	  BOT Virtual Guide
IRJET - BOT Virtual Guide
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
G04124041046
G04124041046G04124041046
G04124041046
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementations
 
XXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerXXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair Kerner
 
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdfleewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
 
Oops Concepts
Oops ConceptsOops Concepts
Oops Concepts
 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood Knowledge
 

More from CJ Jenkins

I am an experience designer
I am an experience designer I am an experience designer
I am an experience designer CJ Jenkins
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis worksCJ Jenkins
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systemsCJ Jenkins
 
Knowledgebase vs Database
Knowledgebase vs DatabaseKnowledgebase vs Database
Knowledgebase vs DatabaseCJ Jenkins
 
Building a semantic website
Building a semantic websiteBuilding a semantic website
Building a semantic websiteCJ Jenkins
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine SpidersCJ Jenkins
 
Twitter for business
Twitter for businessTwitter for business
Twitter for businessCJ Jenkins
 

More from CJ Jenkins (7)

I am an experience designer
I am an experience designer I am an experience designer
I am an experience designer
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systems
 
Knowledgebase vs Database
Knowledgebase vs DatabaseKnowledgebase vs Database
Knowledgebase vs Database
 
Building a semantic website
Building a semantic websiteBuilding a semantic website
Building a semantic website
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
Twitter for business
Twitter for businessTwitter for business
Twitter for business
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

The search engine index

  • 1. The Search Engine Index http://scienceforseo.blogspot.com IR tutorial series: Part 1
  • 2. What is an index? The word “index” can mean many things in computing, but in the case of search engines, it can be defined as: A database where information (after being collected, parsed and processed) is stored to allow for quick retrieval. Cache-based engines store the index along with the corpus (collection of documents). When something is added to the corpus, the index is updated.
  • 3. “Index” We call it that because it's exactly what we called it when it was one of these: And that took its name from the index finger Photo from: http://www.homeschoolinthewoods.com
  • 4. Why use an index? If we didn't have an index, it would take too much time to search through the whole corpus to find documents that matched our query. Creating an index means that the retrieval process is faster and the accuracy is better. The search engine doesn't need to scan each document to know what it's about – this saves on storage and makes the whole process faster.
  • 5.
  • 6.
  • 7. The inverted index It is an index which has terms marked as keys. These map to the document they appear in. The index is sorted by its keys and works well with Boolean operators (AND,OR, AND NOT) We find the documents by matching the terms – this is why we say it is inverted. Diagram by http://developer.apple.com/
  • 8. Limitations It can only tell us if a word occurs in a particular document. It can't tell us how often it occurs or its location in the document, it also can't rank those documents either. That information is very important because it helps the search engine determine how relevant to a query a document is. so... we look at latent semantic indexing (LSI)
  • 9. LSI “ Semantic” = meaning “ Latent” = present but hidden It is the analysis of the hidden meaning of words and how often they occur in a document. It can infer meaning from words which isn't obvious: Computer – PC – Laptop => connected It can put together documents that are not obviously created. It can do this because it creates a “latent semantic space”
  • 10. How does LSI work? It uses lots of vectors and creates a “term document matrix” from all the documents it has. Then 3 matrices are created using SVD (“singular value decomposition”) Of these 3 vectors, the 2 nd contains the singular values of the original matrix in a diagonal matrix Sets of documents are represented as d-dimensional vectors Using the cosine of the angle between these vectors, there is now an easy-to-calculate similarity measure between any two sets of terms and/or documents.
  • 11. A quick sketch of LSI Sets of terms and documents = d-dimensional vectors There are however some big limitations to this method.... Term document matrix Box of documents Lots of vectors Matrix 1 Matrix 2 Matrix 3
  • 12. The resulting dimensions can be very difficult to interpret so there are mistakes. It's unclear what the resulting similarities between terms really mean. The input is a bag-of-words so we don't have any text structure information. A compound term (“bull-headed”) is treated as 2 terms. Ambiguous terms create noise in the vector space There's no way to define the optimal dimensionality of the vector space There's a time complexity for SVD in dynamic collections
  • 13. PLSI “ Probabilistic latent semantic indexing” is a better choice because: It has a more robust statistical foundation and provides a proper generative data model It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise)) - this makes it far more flexible It can deal with domain specific synonymy and polysemous words
  • 14. What did all that mean? “ Generative data model” - It's used for randomly generating observed data from unknown parameters (HMMs are generative data models for example) “ EM algorithm” - it finds the maximum likelihood estimate of parameters in a probabilistic model (where the model depends on unobserved latent variables) – good for machine learning and data clustering. Synonymy – It's the synonym relation between words. A synonym is when 2 different words mean the same thing. Polysemous – a word that has multiple meanings or interpretations
  • 15.
  • 16. How is it different to LSI? The order of the words is lost (but results are still good due to word co-occurrence) Documents can be represented by numeric vectors in a space of words It retrieves topics Each query uses the cosine similarity metric to find the similarity between vectors.
  • 17. More indexing difficulties It's easy for us to pick a document and classify it, well most of the time, but search engines have other difficulties to over come before even getting to the classification stage.
  • 18. Tokenization Machines don't understand sentences in text. They see everything in bytes. Consider: The dog ran in the field We see 6 words. Machine sees 24 characters (chars) The words found in a document are called “tokens”. Information is extracted from documents to be placed in the index. The tokens may be email addresses, words, URLs,... The Part-Of-Speech, line number, sentence number, size and so on can be stored in the index.
  • 19. Section recognition Before tokenization happens, all the major parts of a document are identified. Some documents are newsletters other have a side navigation, some are reports...and the text can be displayed in columns. Machines will read this sequentially though and index the word sequentially as well. The difficulty is finding which view of the document is informative. Some engines will index an abstract representation of the document instead. Most engines don't though. This is also why using JavaScript for example is avoided.
  • 20. Formats Documents come in all flavours on the web. There are documents in HTML, PDF, EXCEL, Powerpoint, and so many others. Before documents are analysed, they are stripped down and the formatting extracted. They are "normalised". It's important for the search engine to not misread "markup" information for content or the index gets polluted.
  • 21. To conclude... The indexing process of a search engine is really very important because if this is wrong, everything is wrong. This is why “Spamdexing” is such an issue. There are a lot of very specialised areas of computing who focus their work on making it easier for machines to create an index. Don't let this short presentation fool you, it is a very very big research issue. Natural language processing is used for rich text analysis, which helps identify what's going on so that the other computational elements can do their job.
  • 22. Resources The inverted index in detail http://tinyurl.com/65hbfd The seminal PLSI paper http://tinyurl.com/54wd76 The seminal LSI paper http://tinyurl.com/5e8v36 The semantic indexing project http://knowledgesearch.org/ Boulder Uni on LSA http://lsa.colorado.edu/ Apache Lucene http://lucene.apache.org/java/docs/ Google test data ($150) http://tinyurl.com/62t4la