SlideShare a Scribd company logo
 Tokenization is the process of breaking a stream of text up into words, phrases, symbols,
or other meaningful elements called tokens.
 A tokenizer relies on simple heuristic. Example
 A continuous stream of alphabets are part of one token
 Tokens are separated by white spaces or punctuation
 Punctuations and white space may or may not be included in the resulting list of tokens
 Words filtered out prior to or after processing of natural language data
 No definite word list
 Features
 Extremely common words
 Contribute minimal in helping selecting documents
 Most common example
 Function words such as the, is, at, which, on
 Most common words including lexical words
 Strategy –
 Sort the terms by collection frequency
 Take the most frequent documents
 Advantage – using a stop list greatly reduces the number of postings a system has to store
 Exemption – phrase search (“Flight to London”)
The goal is to reduce inflectional and derivationally related forms of a word to a common
base form. Example
am, are, is => be
car, cars, car’s, cars’ => car
The result of this mapping of text would be
The boy’s cars are different colours => the boy car be differ colour
 Stemming usually refers to process that chops off the end of words. Includes removal of
derivational affixes
 Lemmatisation refers to doing things properly with the use of vocabulary and
morphological analysis of words
 Aims to remove the inflectional ending
 Returns the base or dictionary form called lemma
word => saw
stemming => s
Lemmatisation => see, saw
 Consist of 5 phrase of word reductions, applied sequentially
 Within each phrase there are various conventions to select rules
 Measure of a word – loosely check the number of syllables to see whether a word is long
enough that it is reasonable to regard the matching portion of the rule as a suffix rather
than as part of stem of the word
(m>1) EMENT ->
Would map replacement to replace but not cement to c
 Porter stemmer stems all of the following words – operate, operating, operates, operation,
operative, operatives, operational to oper
 We will loose considerable precision
 Operational and research
 Operating and system
 Operative and dentistry
 Lookup Algorithm
 Looks for the inflected form in a lookup table
 Simple, fast and easy exception handling
 New/unfamiliar words are not handled
 The production technique
 The lookup table is generally produced unautomatically
 Ex. run => running, runs, runned, runly
 The last two forms are valid but unlikely
 Suffix-scripted algorithm
 A set of rules provide path for algorithm
 if the word ends in 'ed', remove the 'ed'
 if the word ends in 'ing', remove the 'ing'
 if the word ends in 'ly', remove the 'ly'
 Subtask of information extraction
 Seeks to locate and classify elements into pre-defined categories such as names of person,
organization, location, quantities, monetary values
 Takes unannotated block of text. like
Jim bought 300 shares of Acme Corp. in 2006
and produces unannotated block of text that highlights the names of entity
[Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time
 Linguistic category of words
 Noun- any abstract or concrete entity. A person, place, thing, idea
 Pronoun- a substitute for noun/noun phrase
 Adjective – a qualifier of a noun
 Verb- an action, occurrence, or a state of being
 Adverb – any qualifier of an adjective
 Preposition – any establisher of relation or syntactic content
 Conjunction – any syntactic connector
 Interjection – an emotional greeting
 Process of analysing a string of symbols
 Analysis of a sentence by a computer into its constituents
 Results in parse tree showing their syntactic relation to each other

More Related Content

Similar to Information retrieval

An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
inscit2006
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
beshahashenafe20
 
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language TextDesigning A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
CSCJournals
 
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language TextDesigning A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
CSCJournals
 
Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPAndi Wu
 
Introduction to Distributional Semantics
Introduction to Distributional SemanticsIntroduction to Distributional Semantics
Introduction to Distributional SemanticsAndre Freitas
 
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONSIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
IJCSEA Journal
 
Natural Language Processing basics presentation
Natural Language Processing basics presentationNatural Language Processing basics presentation
Natural Language Processing basics presentation
PREETHIRRA2011003040
 
Designing a Rule Based Stemmer for Afaan Oromo Text
Designing a Rule Based Stemmer for Afaan Oromo TextDesigning a Rule Based Stemmer for Afaan Oromo Text
Designing a Rule Based Stemmer for Afaan Oromo Text
Waqas Tariq
 
CH 2.pptx
CH 2.pptxCH 2.pptx
CH 2.pptx
Obsa2
 
COMPILER DESIGN LECTURES -UNIT-2 ST.pptx
COMPILER DESIGN LECTURES -UNIT-2 ST.pptxCOMPILER DESIGN LECTURES -UNIT-2 ST.pptx
COMPILER DESIGN LECTURES -UNIT-2 ST.pptx
Ranjeet Reddy
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
Rupak Roy
 
2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
HayomeTakele
 
Ir 03
Ir   03Ir   03
Artificial Intelligence_NLP
Artificial Intelligence_NLPArtificial Intelligence_NLP
Artificial Intelligence_NLP
ThenmozhiK5
 
AI UNIT-3 FINAL (1).pptx
AI UNIT-3 FINAL (1).pptxAI UNIT-3 FINAL (1).pptx
AI UNIT-3 FINAL (1).pptx
prakashvs7
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithmsRaghu nath
 
UNIT 1 part II.ppt
UNIT 1 part II.pptUNIT 1 part II.ppt
UNIT 1 part II.ppt
Ranjeet Reddy
 

Similar to Information retrieval (20)

An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
 
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language TextDesigning A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
 
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language TextDesigning A Rule Based Stemming Algorithm for Kambaata Language Text
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
 
Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLP
 
Introduction to Distributional Semantics
Introduction to Distributional SemanticsIntroduction to Distributional Semantics
Introduction to Distributional Semantics
 
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONSIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
 
Aaai 2006 Pedersen
Aaai 2006 PedersenAaai 2006 Pedersen
Aaai 2006 Pedersen
 
Natural Language Processing basics presentation
Natural Language Processing basics presentationNatural Language Processing basics presentation
Natural Language Processing basics presentation
 
Designing a Rule Based Stemmer for Afaan Oromo Text
Designing a Rule Based Stemmer for Afaan Oromo TextDesigning a Rule Based Stemmer for Afaan Oromo Text
Designing a Rule Based Stemmer for Afaan Oromo Text
 
CH 2.pptx
CH 2.pptxCH 2.pptx
CH 2.pptx
 
COMPILER DESIGN LECTURES -UNIT-2 ST.pptx
COMPILER DESIGN LECTURES -UNIT-2 ST.pptxCOMPILER DESIGN LECTURES -UNIT-2 ST.pptx
COMPILER DESIGN LECTURES -UNIT-2 ST.pptx
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
 
Ir 03
Ir   03Ir   03
Ir 03
 
Artificial Intelligence_NLP
Artificial Intelligence_NLPArtificial Intelligence_NLP
Artificial Intelligence_NLP
 
AI UNIT-3 FINAL (1).pptx
AI UNIT-3 FINAL (1).pptxAI UNIT-3 FINAL (1).pptx
AI UNIT-3 FINAL (1).pptx
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithms
 
UNIT 1 part II.ppt
UNIT 1 part II.pptUNIT 1 part II.ppt
UNIT 1 part II.ppt
 

More from Ujjawal

fMRI in machine learning
fMRI in machine learningfMRI in machine learning
fMRI in machine learning
Ujjawal
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learningUjjawal
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithmUjjawal
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
Ujjawal
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
Ujjawal
 
Vector space classification
Vector space classificationVector space classification
Vector space classification
Ujjawal
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector spaceUjjawal
 
Bayes’ theorem and logistic regression
Bayes’ theorem and logistic regressionBayes’ theorem and logistic regression
Bayes’ theorem and logistic regressionUjjawal
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
Ujjawal
 

More from Ujjawal (10)

fMRI in machine learning
fMRI in machine learningfMRI in machine learning
fMRI in machine learning
 
Random forest
Random forestRandom forest
Random forest
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learning
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
Vector space classification
Vector space classificationVector space classification
Vector space classification
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector space
 
Bayes’ theorem and logistic regression
Bayes’ theorem and logistic regressionBayes’ theorem and logistic regression
Bayes’ theorem and logistic regression
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 

Recently uploaded

Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 

Recently uploaded (20)

Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 

Information retrieval

  • 1.
  • 2.  Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.  A tokenizer relies on simple heuristic. Example  A continuous stream of alphabets are part of one token  Tokens are separated by white spaces or punctuation  Punctuations and white space may or may not be included in the resulting list of tokens
  • 3.  Words filtered out prior to or after processing of natural language data  No definite word list  Features  Extremely common words  Contribute minimal in helping selecting documents  Most common example  Function words such as the, is, at, which, on  Most common words including lexical words
  • 4.  Strategy –  Sort the terms by collection frequency  Take the most frequent documents  Advantage – using a stop list greatly reduces the number of postings a system has to store  Exemption – phrase search (“Flight to London”)
  • 5. The goal is to reduce inflectional and derivationally related forms of a word to a common base form. Example am, are, is => be car, cars, car’s, cars’ => car The result of this mapping of text would be The boy’s cars are different colours => the boy car be differ colour
  • 6.  Stemming usually refers to process that chops off the end of words. Includes removal of derivational affixes  Lemmatisation refers to doing things properly with the use of vocabulary and morphological analysis of words  Aims to remove the inflectional ending  Returns the base or dictionary form called lemma word => saw stemming => s Lemmatisation => see, saw
  • 7.  Consist of 5 phrase of word reductions, applied sequentially  Within each phrase there are various conventions to select rules  Measure of a word – loosely check the number of syllables to see whether a word is long enough that it is reasonable to regard the matching portion of the rule as a suffix rather than as part of stem of the word (m>1) EMENT -> Would map replacement to replace but not cement to c
  • 8.  Porter stemmer stems all of the following words – operate, operating, operates, operation, operative, operatives, operational to oper  We will loose considerable precision  Operational and research  Operating and system  Operative and dentistry
  • 9.  Lookup Algorithm  Looks for the inflected form in a lookup table  Simple, fast and easy exception handling  New/unfamiliar words are not handled  The production technique  The lookup table is generally produced unautomatically  Ex. run => running, runs, runned, runly  The last two forms are valid but unlikely  Suffix-scripted algorithm  A set of rules provide path for algorithm  if the word ends in 'ed', remove the 'ed'  if the word ends in 'ing', remove the 'ing'  if the word ends in 'ly', remove the 'ly'
  • 10.  Subtask of information extraction  Seeks to locate and classify elements into pre-defined categories such as names of person, organization, location, quantities, monetary values  Takes unannotated block of text. like Jim bought 300 shares of Acme Corp. in 2006 and produces unannotated block of text that highlights the names of entity [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time
  • 11.  Linguistic category of words  Noun- any abstract or concrete entity. A person, place, thing, idea  Pronoun- a substitute for noun/noun phrase  Adjective – a qualifier of a noun  Verb- an action, occurrence, or a state of being  Adverb – any qualifier of an adjective  Preposition – any establisher of relation or syntactic content  Conjunction – any syntactic connector  Interjection – an emotional greeting
  • 12.  Process of analysing a string of symbols  Analysis of a sentence by a computer into its constituents  Results in parse tree showing their syntactic relation to each other