SlideShare a Scribd company logo
1 of 25
Download to read offline
◯Atsushi Keyaki†, Jun Miyazaki†
†: Tokyo Institute of Technology,
Japan
Part-­‐‑of-­‐‑speech  Tagging  for  
Web  Search  Queries  using  a  
Large-­‐‑scale  Web  Corpus
SAC2017  IAR
Objective
•  Accurate part-of-speech (POS) tagging to Web
queries
o POS tags are beneficial in accurate IR
•  Different search strategy per POS tag [1]
•  Identifying unnecessary data with POS tags [2]
o Example
•  Query: “discovery channel”
•  Doc: “Victim’s discovery is broadcasted by the channel”
2
[1]  Crestani  et  al.:  “Short  Queries,  Natural  Language  and  Spoken  Document  	
            Retrieval:  Experiments  at  Glasgow  University”,  TREC-­‐‑6,  1998.
[2]  Chowdhury  and  Mccabe:  “Improving  Information  Retrieval  Systems  using	
            Part  of  Speech  Tagging”,  Univ.  of  Maryland,  1993.
POS  tag  mismatch  may  cause  false  positive
TV  program  (proper  nouns)
common  noun common	
noun
Difficulty  in  query  POS  tagging
•  Characteristics of Web query
o  Length is short (composed of a few words)
o  Capitalization is missing
o  Word order is fairly free
•  Solution of related work [3][4]
o  Utilizing the results of sentence-level morphological analysis
•  Sentences are based on natural language grammar
•  Results of sentence-level morphological analysis are accurate
3
Difficult  to  correctly  identify  POS  tags	
with  existing  morphological  analysis  tool	
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
developed  for  
natural  language
Sentence:  “We        stayed        at              Rif  Carlton.”
Query          :  “rif  carlton”
Difficulty  in  query  POS  tagging
•  Characteristics of Web query
o  Length is short (composed of a few words)
o  Capitalization is missing
o  Word order is fairly free
•  Solution of related work [3][4]
o  Utilizing the results of sentence-level morphological analysis
•  Sentences are based on natural language grammar
•  Results of sentence-level morphological analysis are accurate
4
Difficult  to  correctly  identify  POS  tags	
with  existing  morphological  analysis  tool	
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
developed  for  
natural  language
Sentence:  “We        stayed        at              Rif  Carlton.”
pronoun  verb  particle    proper  noun
Query          :  “rif  carlton”
Difficulty  in  query  POS  tagging
•  Characteristics of Web query
o  Length is short (composed of a few words)
o  Capitalization is missing
o  Word order is fairly free
•  Solution of related work [3][4]
o  Utilizing the results of sentence-level morphological analysis
•  Sentences are based on natural language grammar
•  Results of sentence-level morphological analysis are accurate
5
Difficult  to  correctly  identify  POS  tags	
with  existing  morphological  analysis  tool	
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
Sentence:  “We        stayed        at              Rif  Carlton.”
pronoun  verb  particle    proper  noun
proper  nounQuery          :  “rif  carlton”
developed  for  
natural  language
Difficulty  in  query  POS  tagging
•  Characteristics of Web query
o  Length is short (composed of a few words)
o  Capitalization is missing
o  Word order is fairly free
•  Solution of related work [3][4]
o  Utilizing the results of sentence-level morphological analysis
•  Sentences are based on natural language grammar
•  Results of sentence-level morphological analysis are accurate
6
Difficult  to  correctly  identify  POS  tags	
with  existing  morphological  analysis  tool	
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
Sentence:  “We        stayed        at              Rif  Carlton.”
pronoun  verb  particle    proper  noun
proper  nounQuery          :  “rif  carlton”
developed  for  
natural  language
Frequently  
assigned  POS  tag  
is  employed
Our  approach
•  Related study
o Using sentence-level morphological analysis of
•  Search results [3]
•  Snippet from search logs [4]
o Considering just freq. of assigned POS tags
•  Our approach
o Taking account of global statistics from large corpus
•  Easily available, considering long tail
o Considering co-occurrence of query terms
April 5, 2017SAC2017 IAR 7
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
A  small  number  of  highly  relevant  information	
User  feedback/search  log  is  not  always  available
Preliminary  investigation
•  Morphological analysis to Web queries
o Queries
•  TREC Web track topics (200 queries from 2009-2012)
o  Oracle POS tags are annotated by three assessors
o  Referring to description (information need)
o Morphological analysis tool
•  Stanford Log-linear Part-Of-Speech Tagger [5]
o Model
•  Default model
•  Caseless model
o  Not consider capitalization information during training
o  Try to solve “Capitalization is missing” problem
April 5, 2017SAC2017 IAR 8
[5]  Toutanova  et  al.:  "ʺFeature-­‐‑Rich  Part-­‐‑of-­‐‑Speech  Tagging	
            with  a  Cyclic  Dependency  Network"ʺ,  NAACL  2003.
High  agreement	
Kappa:  0.98
Summary  of  error  analysis
•  Default model
o  Only half of query terms were assigned correct POS tags
o  Almost all of proper nouns were NOT identified
•  72% of proper nouns are mistakenly assigned as common nouns
•  Error: “obama”, “india”, “ritz carlton”, “discovery channel”
•  Caseless model
o  Around 75% of query terms were assigned correct POS
tags
o  Many proper nouns were identified
•  Common nouns are mistakenly identified as proper nouns
•  Errors caused by a partial grammatical rule
o  “lower heart rate”
o  “gs pay rate”
April 5, 2017SAC2017 IAR 9
verb adjective
common  noun verb
:  Adjectives  come  before  common  nouns
:  Verbs  come  after  a  subject
Proposed  POS  tagging
•  Summary of the error analysis
o  Proper nouns/common nouns cannot be identified
•  Problem1: Capitalization is missing
o  Grammatical rules are mistakenly applied
•  Problem2: Word order is fairly free
•  Related study
o  A small num. of highly relevant information
•  Problem3: User feedback and user log are not always available
•  Approach
o  Sol-P1: Sentence-level morphological analysis
o  Sol-P2: Proposing a POS tagging not based on word order
o  Sol-P3: Large-scale Web corpus (easily available)
o  Building the term-POS database (TPDB)
•  Morphological analysis are applied offline
April 5, 2017SAC2017 IAR 10
Processing  flow
April 5, 2017SAC2017 IAR 11
Large-scale
Web corpus
S1 tA/P1 tB/P2 tC/P3tA tB tC
tA tC tD
tC tE tA tF
tA/P1 tC/P4 tD/P5
tC/P3 tE/P1 tA/P2 tF/P1
tB tD tB/P2 tD/P3
Morphological
analysis
S2
S3
S4
S1
S2
S3
S4
TPDB
tA/P1 tB/P2 tC/P3
tA/P1 tC/P4 tD/P5
tC/P3 tE/P1 tA/P2 tA/P1
S1
S2
S3
tA tC Query
tA/P1 tC/P3
tA/P1 tC/P4
Scoring	
method
Offline Online
Insert
Scoring  for  POS  tagging
•  Design principle
o  Frequently appearing POS tags in the corpus are assigned to queries
o  POS tags of a sentence are emphasized when the sentence contains
more kinds of query terms
•  Co-occurrence of query terms is a useful clue
•  Step of scoring
o  Retrieving entries which contain query terms from TPDB
o  Braking down into pairs of query terms
•  Query: “tA tB tC”
o  Counting entries per the term-POS pairs for each query term pair
•  Query term pair: {tA tB}
o  Scoring with three proposed methods
April 5, 2017 12
{tA  tB}  {tA  tC}  {tB  tC}
tA/P1 tB/P2 5 0.33 (5/15)
tA/P1 tB/P3 3 0.20 (3/15)
tA/P2 tB/P4 7 0.47 (7/15)
freq. normalized freq. num.  of  entries  
containing  	
tA/P1 and tB/P2
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 13
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 14
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
tA/P2
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 15
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 16
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
tA/P3
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 17
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
tA/P1
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 18
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
tA/P1
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
11
Experiment
•  Datasets
o  TREC Web track topics
•  200 queries from 2009-2012
o  MS-251
•  Microsoft search log used in related studies [3][4]
•  Large-scale Web corpus
o  ClueWeb09 Category B
•  50 million Web documents
•  Evaluation methods
o  Proposed methods: MaxFreq, MostLikelihood, AllCombi
o  Existing methods: Stanford, Caseless, SingleFreq
April 5, 2017SAC2017 IAR 19
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
The  most  frequently  appearing  POS  tag  is  assigned
Skip  because  the  trend  is  the  same
POS-­‐‑tagged  Web  track  topics
•  AllCombi: the highest for all terms, common noun, proper noun
o  Good at judging nouns
o  Considering more diversified context is useful
•  Global statistics from large-scale Web corpus is useful
•  MaxFreq and MostLikelihood: the highest for common noun, verb,
adjective
•  Every proposed method significantly outperformed (VS Caseless)
April 5, 2017SAC2017 IAR 20
Precision All  query	
terms
Common	
noun
Proper	
noun
Verb Adjective sign  test  with	
Caseless
MaxFreq .814 .825 .833 .769 .647 p  <  0.05
MostLikelihood .814 .825 .833 .769 .647 p  <  0.05
AllCombi .821 .825 .860 .714 .629 p  <  0.01
Caseless .763 .789 .751 .733 .690
SingleFreq .702 .775 .670 .533 .581
Stanford .547 .550 1.0 .722 .451
Effect  of  the  proposed  method
•  AllCombi correctly identified many query terms
•  Some errors by partial grammatical rules still remain
•  Negative effects of the proposed method
o  “president” in the corpus are often identified as proper
nouns
•  Need to normalize term weights
April 5, 2017SAC2017 IAR 21
Query Stanford AllCombi
obama
india
rif  carlton
lower  heart  rate
gs  pay  rate
president  united  states
Conclusion
•  POS tagging to Web queries
o  Results of sentence-level morphological analysis
o  Large-scale Web corpus
o  Proposed three scoring methods
•  Experiments
o  Considering more diversified context is useful
o  The best proposed method differs by POS tag
o  Overwhelmed existing tools and existing studies
•  Future work
o  Combination of proposed methods may improve accuracy
o  Database schema design for fast POS tagging
April 5, 2017SAC2017 IAR 22
Default  model
April 5, 2017SAC2017 IAR 23
POS  tags Precision Recall
Common  noun .550 .985
Proper  noun 1.0 .010
Verb .722 .867
Adjective .451 .958
All  query  terms .547 .547
•  Nearly half of query terms
were assigned correct POS tags
•  Almost all of proper nouns
were not identified
o  72% of proper nouns are
mistakenly assigned as common
nouns
o  Error: “obama”, “india”, “ritz
carlton”, “discovery channel”
•  Errors caused by a partial grammatical rule
o  “lower heart rate”
o  “gs pay rate”
verb adjective
common  noun verb
:  Adjectives  come  before  common  nouns
:  Verbs  come  after  a  subject
Caseless  model
•  Precision and recall improved overall
•  Many proper nouns were identified
o  31% of proper nouns are mistakenly assigned as common nouns
o  Precision is decreased
•  Harm of partial grammatical rules still exist
o  “discovery channel store”
April 5, 2017SAC2017 IAR 24
common  noun proper  noun
POS  tags Precision Recall
Common  noun .789 .769
Proper  noun .751 .640
Verb .733 .733
Adjective .690 .833
All  query  terms .763 .763
MS-­‐‑251
•  The trend of the proposed methods is the same
o The ratio of POS tags affected the order
•  AllCombi
•  MaxFreq, MostLikelihood
o The proposed methods are better than [4]
April 5, 2017SAC2017 IAR 25
Precision
MaxFreq .890
MostLikelihood .895
AllCombi .893
the  best  method  in  [4] .858
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
Good  at  judging  nouns
Good  at  judging  verb,  adjective

More Related Content

What's hot

Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
Benjamin Habegger
 
KIT Graduiertenkolloquium 11.05.2016
KIT Graduiertenkolloquium 11.05.2016KIT Graduiertenkolloquium 11.05.2016
KIT Graduiertenkolloquium 11.05.2016
Dr.-Ing. Thomas Hartmann
 

What's hot (17)

Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
 
Linguistic markup and transclusion processing in XML documents
Linguistic markup and transclusion processing in XML documentsLinguistic markup and transclusion processing in XML documents
Linguistic markup and transclusion processing in XML documents
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolution
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
Deep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender SystemsDeep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender Systems
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
Deep natural language processing in search systems
Deep natural language processing in search systemsDeep natural language processing in search systems
Deep natural language processing in search systems
 
2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
KIT Graduiertenkolloquium 11.05.2016
KIT Graduiertenkolloquium 11.05.2016KIT Graduiertenkolloquium 11.05.2016
KIT Graduiertenkolloquium 11.05.2016
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
Phd tesis olga giraldo 10mayo
Phd tesis olga giraldo 10mayoPhd tesis olga giraldo 10mayo
Phd tesis olga giraldo 10mayo
 
Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...
Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...
Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
master_thesis_greciano_v2
master_thesis_greciano_v2master_thesis_greciano_v2
master_thesis_greciano_v2
 
PhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud RahmanPhD Comprehensive exam of Masud Rahman
PhD Comprehensive exam of Masud Rahman
 

Similar to Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus

Introduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersIntroduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmers
Kevin Lee
 
RFS Search Lang Spec
RFS Search Lang SpecRFS Search Lang Spec
RFS Search Lang Spec
Jing Kang
 

Similar to Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus (20)

Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
SIGIR 2011
SIGIR 2011SIGIR 2011
SIGIR 2011
 
Filling the gaps
Filling the gapsFilling the gaps
Filling the gaps
 
Introduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersIntroduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmers
 
Spoken Content Retrieval
Spoken Content RetrievalSpoken Content Retrieval
Spoken Content Retrieval
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
UNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGE
UNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGEUNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGE
UNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGE
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
C:\Fakepath\Learning Through Conversation
C:\Fakepath\Learning Through ConversationC:\Fakepath\Learning Through Conversation
C:\Fakepath\Learning Through Conversation
 
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsEKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
 
RFS Search Lang Spec
RFS Search Lang SpecRFS Search Lang Spec
RFS Search Lang Spec
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
Semantic Technologies and Programmatic Access to Semantic Data
Semantic Technologies and Programmatic Access to Semantic Data Semantic Technologies and Programmatic Access to Semantic Data
Semantic Technologies and Programmatic Access to Semantic Data
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of Information
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 

Recently uploaded

Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
Sheetaleventcompany
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 

Recently uploaded (20)

Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptx
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 

Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus

  • 1. ◯Atsushi Keyaki†, Jun Miyazaki† †: Tokyo Institute of Technology, Japan Part-­‐‑of-­‐‑speech  Tagging  for   Web  Search  Queries  using  a   Large-­‐‑scale  Web  Corpus SAC2017  IAR
  • 2. Objective •  Accurate part-of-speech (POS) tagging to Web queries o POS tags are beneficial in accurate IR •  Different search strategy per POS tag [1] •  Identifying unnecessary data with POS tags [2] o Example •  Query: “discovery channel” •  Doc: “Victim’s discovery is broadcasted by the channel” 2 [1]  Crestani  et  al.:  “Short  Queries,  Natural  Language  and  Spoken  Document              Retrieval:  Experiments  at  Glasgow  University”,  TREC-­‐‑6,  1998. [2]  Chowdhury  and  Mccabe:  “Improving  Information  Retrieval  Systems  using            Part  of  Speech  Tagging”,  Univ.  of  Maryland,  1993. POS  tag  mismatch  may  cause  false  positive TV  program  (proper  nouns) common  noun common noun
  • 3. Difficulty  in  query  POS  tagging •  Characteristics of Web query o  Length is short (composed of a few words) o  Capitalization is missing o  Word order is fairly free •  Solution of related work [3][4] o  Utilizing the results of sentence-level morphological analysis •  Sentences are based on natural language grammar •  Results of sentence-level morphological analysis are accurate 3 Difficult  to  correctly  identify  POS  tags with  existing  morphological  analysis  tool [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. developed  for   natural  language Sentence:  “We        stayed        at              Rif  Carlton.” Query          :  “rif  carlton”
  • 4. Difficulty  in  query  POS  tagging •  Characteristics of Web query o  Length is short (composed of a few words) o  Capitalization is missing o  Word order is fairly free •  Solution of related work [3][4] o  Utilizing the results of sentence-level morphological analysis •  Sentences are based on natural language grammar •  Results of sentence-level morphological analysis are accurate 4 Difficult  to  correctly  identify  POS  tags with  existing  morphological  analysis  tool [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. developed  for   natural  language Sentence:  “We        stayed        at              Rif  Carlton.” pronoun  verb  particle    proper  noun Query          :  “rif  carlton”
  • 5. Difficulty  in  query  POS  tagging •  Characteristics of Web query o  Length is short (composed of a few words) o  Capitalization is missing o  Word order is fairly free •  Solution of related work [3][4] o  Utilizing the results of sentence-level morphological analysis •  Sentences are based on natural language grammar •  Results of sentence-level morphological analysis are accurate 5 Difficult  to  correctly  identify  POS  tags with  existing  morphological  analysis  tool [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. Sentence:  “We        stayed        at              Rif  Carlton.” pronoun  verb  particle    proper  noun proper  nounQuery          :  “rif  carlton” developed  for   natural  language
  • 6. Difficulty  in  query  POS  tagging •  Characteristics of Web query o  Length is short (composed of a few words) o  Capitalization is missing o  Word order is fairly free •  Solution of related work [3][4] o  Utilizing the results of sentence-level morphological analysis •  Sentences are based on natural language grammar •  Results of sentence-level morphological analysis are accurate 6 Difficult  to  correctly  identify  POS  tags with  existing  morphological  analysis  tool [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. Sentence:  “We        stayed        at              Rif  Carlton.” pronoun  verb  particle    proper  noun proper  nounQuery          :  “rif  carlton” developed  for   natural  language Frequently   assigned  POS  tag   is  employed
  • 7. Our  approach •  Related study o Using sentence-level morphological analysis of •  Search results [3] •  Snippet from search logs [4] o Considering just freq. of assigned POS tags •  Our approach o Taking account of global statistics from large corpus •  Easily available, considering long tail o Considering co-occurrence of query terms April 5, 2017SAC2017 IAR 7 [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. A  small  number  of  highly  relevant  information User  feedback/search  log  is  not  always  available
  • 8. Preliminary  investigation •  Morphological analysis to Web queries o Queries •  TREC Web track topics (200 queries from 2009-2012) o  Oracle POS tags are annotated by three assessors o  Referring to description (information need) o Morphological analysis tool •  Stanford Log-linear Part-Of-Speech Tagger [5] o Model •  Default model •  Caseless model o  Not consider capitalization information during training o  Try to solve “Capitalization is missing” problem April 5, 2017SAC2017 IAR 8 [5]  Toutanova  et  al.:  "ʺFeature-­‐‑Rich  Part-­‐‑of-­‐‑Speech  Tagging            with  a  Cyclic  Dependency  Network"ʺ,  NAACL  2003. High  agreement Kappa:  0.98
  • 9. Summary  of  error  analysis •  Default model o  Only half of query terms were assigned correct POS tags o  Almost all of proper nouns were NOT identified •  72% of proper nouns are mistakenly assigned as common nouns •  Error: “obama”, “india”, “ritz carlton”, “discovery channel” •  Caseless model o  Around 75% of query terms were assigned correct POS tags o  Many proper nouns were identified •  Common nouns are mistakenly identified as proper nouns •  Errors caused by a partial grammatical rule o  “lower heart rate” o  “gs pay rate” April 5, 2017SAC2017 IAR 9 verb adjective common  noun verb :  Adjectives  come  before  common  nouns :  Verbs  come  after  a  subject
  • 10. Proposed  POS  tagging •  Summary of the error analysis o  Proper nouns/common nouns cannot be identified •  Problem1: Capitalization is missing o  Grammatical rules are mistakenly applied •  Problem2: Word order is fairly free •  Related study o  A small num. of highly relevant information •  Problem3: User feedback and user log are not always available •  Approach o  Sol-P1: Sentence-level morphological analysis o  Sol-P2: Proposing a POS tagging not based on word order o  Sol-P3: Large-scale Web corpus (easily available) o  Building the term-POS database (TPDB) •  Morphological analysis are applied offline April 5, 2017SAC2017 IAR 10
  • 11. Processing  flow April 5, 2017SAC2017 IAR 11 Large-scale Web corpus S1 tA/P1 tB/P2 tC/P3tA tB tC tA tC tD tC tE tA tF tA/P1 tC/P4 tD/P5 tC/P3 tE/P1 tA/P2 tF/P1 tB tD tB/P2 tD/P3 Morphological analysis S2 S3 S4 S1 S2 S3 S4 TPDB tA/P1 tB/P2 tC/P3 tA/P1 tC/P4 tD/P5 tC/P3 tE/P1 tA/P2 tA/P1 S1 S2 S3 tA tC Query tA/P1 tC/P3 tA/P1 tC/P4 Scoring method Offline Online Insert
  • 12. Scoring  for  POS  tagging •  Design principle o  Frequently appearing POS tags in the corpus are assigned to queries o  POS tags of a sentence are emphasized when the sentence contains more kinds of query terms •  Co-occurrence of query terms is a useful clue •  Step of scoring o  Retrieving entries which contain query terms from TPDB o  Braking down into pairs of query terms •  Query: “tA tB tC” o  Counting entries per the term-POS pairs for each query term pair •  Query term pair: {tA tB} o  Scoring with three proposed methods April 5, 2017 12 {tA  tB}  {tA  tC}  {tB  tC} tA/P1 tB/P2 5 0.33 (5/15) tA/P1 tB/P3 3 0.20 (3/15) tA/P2 tB/P4 7 0.47 (7/15) freq. normalized freq. num.  of  entries   containing   tA/P1 and tB/P2
  • 13. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 13 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq.
  • 14. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 14 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. tA/P2 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57
  • 15. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 15 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57
  • 16. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 16 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. tA/P3 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57
  • 17. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 17 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. tA/P1 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57
  • 18. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 18 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. tA/P1 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57 11
  • 19. Experiment •  Datasets o  TREC Web track topics •  200 queries from 2009-2012 o  MS-251 •  Microsoft search log used in related studies [3][4] •  Large-scale Web corpus o  ClueWeb09 Category B •  50 million Web documents •  Evaluation methods o  Proposed methods: MaxFreq, MostLikelihood, AllCombi o  Existing methods: Stanford, Caseless, SingleFreq April 5, 2017SAC2017 IAR 19 [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. The  most  frequently  appearing  POS  tag  is  assigned Skip  because  the  trend  is  the  same
  • 20. POS-­‐‑tagged  Web  track  topics •  AllCombi: the highest for all terms, common noun, proper noun o  Good at judging nouns o  Considering more diversified context is useful •  Global statistics from large-scale Web corpus is useful •  MaxFreq and MostLikelihood: the highest for common noun, verb, adjective •  Every proposed method significantly outperformed (VS Caseless) April 5, 2017SAC2017 IAR 20 Precision All  query terms Common noun Proper noun Verb Adjective sign  test  with Caseless MaxFreq .814 .825 .833 .769 .647 p  <  0.05 MostLikelihood .814 .825 .833 .769 .647 p  <  0.05 AllCombi .821 .825 .860 .714 .629 p  <  0.01 Caseless .763 .789 .751 .733 .690 SingleFreq .702 .775 .670 .533 .581 Stanford .547 .550 1.0 .722 .451
  • 21. Effect  of  the  proposed  method •  AllCombi correctly identified many query terms •  Some errors by partial grammatical rules still remain •  Negative effects of the proposed method o  “president” in the corpus are often identified as proper nouns •  Need to normalize term weights April 5, 2017SAC2017 IAR 21 Query Stanford AllCombi obama india rif  carlton lower  heart  rate gs  pay  rate president  united  states
  • 22. Conclusion •  POS tagging to Web queries o  Results of sentence-level morphological analysis o  Large-scale Web corpus o  Proposed three scoring methods •  Experiments o  Considering more diversified context is useful o  The best proposed method differs by POS tag o  Overwhelmed existing tools and existing studies •  Future work o  Combination of proposed methods may improve accuracy o  Database schema design for fast POS tagging April 5, 2017SAC2017 IAR 22
  • 23. Default  model April 5, 2017SAC2017 IAR 23 POS  tags Precision Recall Common  noun .550 .985 Proper  noun 1.0 .010 Verb .722 .867 Adjective .451 .958 All  query  terms .547 .547 •  Nearly half of query terms were assigned correct POS tags •  Almost all of proper nouns were not identified o  72% of proper nouns are mistakenly assigned as common nouns o  Error: “obama”, “india”, “ritz carlton”, “discovery channel” •  Errors caused by a partial grammatical rule o  “lower heart rate” o  “gs pay rate” verb adjective common  noun verb :  Adjectives  come  before  common  nouns :  Verbs  come  after  a  subject
  • 24. Caseless  model •  Precision and recall improved overall •  Many proper nouns were identified o  31% of proper nouns are mistakenly assigned as common nouns o  Precision is decreased •  Harm of partial grammatical rules still exist o  “discovery channel store” April 5, 2017SAC2017 IAR 24 common  noun proper  noun POS  tags Precision Recall Common  noun .789 .769 Proper  noun .751 .640 Verb .733 .733 Adjective .690 .833 All  query  terms .763 .763
  • 25. MS-­‐‑251 •  The trend of the proposed methods is the same o The ratio of POS tags affected the order •  AllCombi •  MaxFreq, MostLikelihood o The proposed methods are better than [4] April 5, 2017SAC2017 IAR 25 Precision MaxFreq .890 MostLikelihood .895 AllCombi .893 the  best  method  in  [4] .858 [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. Good  at  judging  nouns Good  at  judging  verb,  adjective