SlideShare a Scribd company logo
1 of 25
Download to read offline
Automated building of
taxonomies for search
engines
Boris Galitsky & Greg Makowski
.
What can be a scalable way to
automatically build a
taxonomies of entities to
improve search relevance?
Taxonomy construction starts from
the seed entities and mines the
web for new entities associated
with them.
To form these new entities, machine
learning of syntactic parse trees
(syntactic generalization) is applied
It form commonalities between
various search results for existing
entities on the web.
Taxonomy and syntactic
generalization is applied to
relevance improvement in search
and text similarity assessment in
commercial setting; evaluation
results show substantial
contribution of both sources.
Automated customer service rep.
Q: Can you reactivate my card which I
am trying to use in Nepal?
A: We value you as a customer… We
will cancel your card… New
card will be mailed to your
California address …
A child with severe form of autism
Q: Can you give your candy to my
daughter who is hungry now
and is about to cry?
A: No, my mom told me not to feed
babies. Its wrapper is nice
and blue. I need to wash my
hands before I eat it … …
Entities need to
make sense together
Why ontologies are
needed for search
Human and auto agents having
difficulties processing texts if
required ontologies are missing
Knowing how entities are connected
would improve search results
Condition “active paddling” is ignored or
misinterpreted, although Google knows that
it is a valid combination (‘paddling’ can be
‘active’)
• In the above example “white water
rafting in Oregon with active
paddling with kids” active is
meaningless without paddling.
• So if the system can’t find answers
with ‘active paddling’, try finding with
‘paddling’, but do not try finding with
‘active’ but without ‘paddling’.
Difficulty in building
taxonomies
Building, tuning and managing taxonomies and ontologies is
rather costly since a lot of manual operations are required.
A number of studies proposed automated building of
taxonomies based on linguistic resources and/or statistical
machine learning, (Kerschberg et al 2003, Liu &Birnbaum
2008, Kozareva et al 2009).
However, most of these approaches have not found practical
applications due to:
– insufficient accuracy of resultant search,
– limited expressiveness of representations of queries of
real users,
– high cost associated with manual construction of
linguistic resources and their limited adjustability.
The main challenge in building a taxonomy tree is to make it
as deep as possible to incorporate longer chains of
relationships, so more specific and more complicated
questions can be answered.
• It is based on initial set of key entities (a
seed) for given vertical knowledge domain.
• This seed is then automatically extended by
mining of web documents which include a
meaning of a current taxonomy node.
• This node is further extended by entities
which are the results of inductive learning of
commonalities between these documents.
• These commonalities are extracted using
an operation of syntactic generalization,
which finds the common parts of syntactic
parse trees of a set of documents, obtained
for the current taxonomy node.
We propose
automated taxonomy
building mechanism
Therefore automated or semi-automated
approach is required for practical apps
Contribution
Proposed taxonomy learning algorithm aims to improve
vertical search relevance and will be evaluated in a
number of search-related tasks The contribution of this
study is three-fold:
• Propose and implement a mechanism for using
taxonomy trees for the deterministic classification of
answers as relevant and irrelevant.
• Implement an algorithm to automate the building of
such a taxonomy for a vertical domain, given a seed of
key entities.
• Design a domain-independent linguistic engine for
finding commonalities/similarities between texts, based
on parse trees, to support 1 and 2.
A number of currently available general-
purpose resources, such as DBPEdia,
Freebase, and Yago, assist entity-related
searches but are insufficient to filter out
irrelevant answers that concern a certain
activity with an entity and its multiple
parameters. A set of vertical ontologies, such
as last.fm for artists, are also helpful for
entity-based searches in vertical domains;
however, their taxonomy trees are rather
shallow, and their usability for recognizing
irrelevant answers is limited.
For a query q with keywords {a b c} and its arbitrary relevant answer A we define that query is about b,
is-about(q, {b}) if queries {a b} and {b c} are relevant or marginally relevant to A, and {ac} is irrelevant to
A.
Our definition of query understanding, which is rather narrow, is the ability to say which keywords in the
query are essential (such as b in the above example), so that without them the other query terms
become meaningless. Also, an answer which does not contain b is irrelevant to the query which includes
b.
• For example, is-about({machine, learning, algorithm}, {machine, learning}), is-about({machine,
learning, algorithm}, {algorithm}), is-about({machine, learning, algorithm}, {learning, algorithm}), but
not is-about({machine, learning, algorithm}, {machine})
• For query {a b c d}, if b is essential (is-about({a b c d}, {b}), c can also be essential when b is in the
query such that {a b c}, {b c d}, {b c} are relevant, even {a b}, {b d} are (marginally) relevant, but {a d}
is not (is-about({a b c d}, {b c}).
Hence for a query {abcd} and two answers (snippets) {bcd…efg} and {acd…efg}, the former is relevant
and the latter is not.
Must-occur keywords
Achieving relevancy using a taxonomy is based on totally different mechanism than a conventional
TF∗IDF based search.
In the latter, importance of terms is based on the frequency of occurrence.
For an NL query (not a Boolean query) any term can be omitted in the search result if the rest of terms
give acceptable relevancy score.
In case of a Boolean query, this is true for each of its conjunctive member. However, in the taxonomy-
based search we know which terms should occur in the answer and which terms must occur there,
otherwise the search result becomes irrelevant.
Let us consider a totality of keywords in a domain D; these keywords occur in questions and answers in
this domain. There is always a hierarchy of these keywords: some are always more important than
others in a sense of is-about relation.
• Keyword tax is more important than deduction, individual, return, business.
• is-about({tax deduction}, {tax}) but not is-about({tax deduction}, {deduction}), since without context of
tax keyword deduction is ambiguous;
• is-about({individual, tax, return}, {tax, return}) but not is-about({individual, tax, return}, {individual}),
since individual acquires sense as an adjective only in the context of tax.
• At the same time, the above keywords are more important than partial cases of the situations they
denote such as submission deadline;
• is-about({individual, tax, return, submission, deadline}, {individual, tax, return}) but not
is-about({individual, tax, return, submission, deadline}, {submission, deadline})
because submission deadline may refer to something totally different
Hierarchy of must-occur keywords
We introduce a partial order on the set of subsets of
keywords K1, K2 ∈ 2D
K1 > K2 iff is-about(K1 ∪ K2, K1) but not is-about(K1 ∪ K2, K2).
We say that a path Tp covers a query Q if the set of keywords
for the nodes of Tp is a super-set of Q. If multiple paths cover
a
query Q producing different intersections Q ∩ Tp then this
query has multiple meanings in the domain; for each such
meaning a separate set of acceptable answers is expected.
An answer ai  A is acceptable if it includes all essential
(according to is_about) keywords from the query Q as found in
the taxonomy path Tp T. For any taxonomy path Tp which
covers the question q (intersections of their keywords is not
empty), these intersection keywords must be in the
acceptable answer ai.
 Tp T: Tp ∩ X  Xi  Tp ∩ X.
Answer acceptable given
taxonomy
For a question
(Q) "When can I file extension of time for my tax
return?"
let us imagine two answers:
• (A1) "You need to file form 1234 to request a 4
month extension of time to file your tax return"
• (A2) "You need to download file with extension
'pdf', print and complete it to do your taxes".
We expect the closest taxonomy path to be:
(T) tax - file-return - extension-of-time.
tax is a main entity, file-return we expect to be in the
seed, and extension-of-time would be the learned
entity, so A1 will match with taxonomy and is an
acceptable answer, and A2 is not.
A question and
two answers
Relevance verification algorithm
input: query Q
output: the best answer abest and the set of acceptable answers Aa
1) For a query Q, obtain a set of candidate answers A by available means (using
keywords, using internal index, or using external index of search engine’s APIs);
2) Find a path of taxonomy Tp which covers maximal number of terms in Q, along with
other paths, which cover Q, to form a set P = { Tp1, Tp2, …}.
Unless acceptable answer is found:
3) Compute the set Tp ∩ Q.
For each answer ai  A
4) Compute ai ∩( Tp ∩ Q)) and test if all essential words from the query,
which exists in Tp are also in the answer (acceptability test)
5) Compute similarity score of Q with or each ai
6) Compute the best answer abest and the set of acceptable answers Aa.
If no acceptable answer found, return to 2) for the next path from P.
7) Return abest and the set of acceptable answers Aa if available.
Providing multiple
answers as a result of
default reasoning
Facts Si ,
comprising the
query
representation
(occurrences of
words in a query)
Default rules, establishing the
meanings of words based on the
other words and the meanings
that have been established
Successful &
closed process:
extension
@S1, @S2 ,…
 answer 1
Successful &
closed process:
extension
@S3, @S1 ,…
 answer 2
Either
unsuccessful or
non-closed
process:
No extension
Using default logic to handle
ambiguity in search
Building extensions of default theory for each
meaning
A simplified step 1 of ontology learning
Currently available: tax – deduct
1) Get search results for currently available expressions
2) Select attributes based on their linguistic occurrence (shown in
yellow)
3) Find common attributes
(commonalities between
search results, shown in
red, like ‘overlook’).
4) Extend the taxonomy
path by adding newly
acquired attribute
Tax-deduct-overlook
Step 2 of ontology learning (more
details)
Currently available taxonomy path: tax – deduct - overlook
1) Get search results
2) Select attributes based on their linguistic occurrence (modifiers of
entities from the current taxonomy path)
3) Find common expressions between search results as syntactic
generalization, like ‘PRP-mortgage’
4) Extend the taxonomy path
by adding newly acquired
attribute
Tax-deduct-overlook-
mortgage,
Tax-deduct- overlook –
no_itemize
…
Step 3 of ontology learning
Currently available taxonomy path: tax – deduct – overlook-
mortgage
1) Get search results
2) Perform syntactic generalization, finding common maximal parse
sub-trees excluding the current taxonomy path
3) If nothing in common any more, this is the taxonomy leave (stop
growing the current path).
Possible learning results
(taxonomy fragment)
If a keyword is in a query, and in the
closest taxonomy path, it HAS TO BE
in the answerQuery:
can I deduct tax on
mortgage escrow account:
Closest taxonomy path:
tax – deduct – overlook-
mortgage- escrow_account
Then keywords/multiwords
have to be in the answer:
{deduct ,tax , mortgage
escrow_account }
Wrong answers
sell_hobby=>[[deductions, collection], [making, collection],
[sales, business, collection], [collectibles, collection], [loss,
hobby, collection], [item, collection], [selling, business,
collection], [pay, collection], [stamp, collection], [deduction,
collection], [car, collection], [sell, business, collection], [loss,
collection]]
benefit=>[[office, child, parent], [credit, child, parent],
[credits, child, parent], [support, child, parent], [making, child,
parent], [income, child, parent], [resides, child, parent],
[taxpayer, child, parent], [passed, child, parent], [claiming,
child, parent], [exclusion, child, parent], [surviving, benefits,
child, parent], [reporting, child, parent],
hardship=>[[apply, undue], [taxpayer, undue], [irs, undue],
[help, undue], [deductions, undue], [credits, undue], [cause,
undue], [means, required, undue], [court, undue].
Taxonomy fragment
Improving the precision of text similarity:
articles, blogs, tweets, images and
videos
We verify if an image
belongs here, based on
its caption Using syntactic
generalization to access
relevance
Generalizing two sentences and
its application
Improvement of search relevance by checking syntactic similarity between query
and sentences in search hits. Syntactic similarity is measured via generalization.
Such syntactic similarity is important when a search query contains keywords
which form a phrase , domain-specific expression, or an idiom, such as “shot to
shot time”, “high number of shots in a short amount of time”.
Based on syntactic similarity, search results can be re-sorted based on the
obtained similarity score
Based on generalization, we can distinguish meaningful (informative) and
meaningless (uninformative) opinions, having collected respective datasets
Meaningful
sentence to
be shown as
search result
Not very meaningful
sentence to be shown,
even if matches the
search query
Generalizing sentences & phrases
noun phrase [ [JJ-* NN-zoom NN-* ], [JJ-digital NN-camera ]]
About ZOOM and DIGITAL CAMERA
verb phrase [ [VBP-* ADJP-* NN-zoom NN-camera ], [VB-* NN-
zoom IN-* NN-camera ]]
To do something with ZOOM –…- CAMERA
prepositional phrase [ [IN-* NN-camera ], [IN-for NN-* ]]
With/for/to/in CAMERA, FOR something
Obtain parse trees. Group by sub-trees for each phrase type
Extend list of phrases by paraphrasing (semantically equivalent expressions)
For every phrase type
For each pair of tree lists, perform pair-wise generalization
For a pair of trees, perform alignment
For a pair of words (nodes), generalize them
Remove more general trees (if less general exist) from the resultant list
VP [VB-use DT-the JJ-digital NN-zoom IN-of DT-this NN-
camera IN-for VBG-filming NNS-insects ] +
VP [VB-get JJ-short NN-focus NN-zoom NN-lens IN-for JJ-
digital NN-camera ]
=
[VB-* JJ-* NN-zoom NN-* IN-for NNS-* ]
score = score(NN) + score(PREP) + 3*score(<POS*>)
Meaning:
“Do-something with some-kind-of ZOOM something FOR
something-else”
Generalizing phrasesDeriving a meaning by
generalization
Generalization: from words to
phrases to sentences to paragraphs
Syntactic generalization
helps with microtext when
ontology use is limited
Learning similarity between syntactic
trees
1. Obtain parsing tree for each sentence. For each word (tree node)
we have lemma, part of speech and form of word information, as
well as an arc to the other node.
2. Split sentences into sub-trees which are phrases for each type:
verb, noun, prepositional and others; these sub-trees are
overlapping. The sub-trees are coded so that information about
occurrence in the full tree is retained.
3. All sub-trees are grouped by phrase types.
4. Extending the list of phrases by adding equivalence
transformations Generalize each pair of sub-trees for both
sentences for each phrase type.
5. For each pair of sub-trees yield the alignment, and then generalize
each node for this alignment. For the obtained set of trees
(generalization results), calculate the score.
6. For each pair of sub-trees for phrases, select the set of
generalizations with highest score (least general).
7. Form the sets of generalizations for each phrase types whose
elements are sets of generalizations for this type.
8. Filtering the list of generalization results: for the list of
generalization for each phrase type, exclude more general
elements from lists of generalization for given pair of phrases.
Generalization of semantic role
expressions
Generalization algorithm
*
Evaluation
Media/
method of
text similarity
assessment
Full
size
news
articles
Abstracts
of articles
Blog
posting
Commen
ts
Images Videos
Frequencies
of terms in
documents
29.3% 26.1% 31.4% 32.0% 24.1% 25.2%
Syntactic
generalization
17.8% 18.4% 20.8% 27.1% 20.1% 19.0%
Taxonomy-
based
45.0% 41.7% 44.9% 52.3% 44.8% 43.1%
Hybrid
(taxonomy +
syntactic)
13.2% 13.6% 15.5% 22.1% 18.2% 18.0%
Hybrid approach improves text
similarity/relevance assessment
Ordering of search results based on
generalization, taxonomy, and conventional
search engine
Classification of short texts
Evaluation in vertical search domain
Query phrase sub-
type
Relevancyofbaseline
Yahoosearch,%,
averagingover20
searches
Relevancyofbaseline
Bingsearch,%,averaging
over20searches
Relevancyofre-sorting
bygeneralization,%,
averagingover20
searches
Relevancyofre-sorting
byusingtaxonomy,%,
averagingover20
searches
Relevancyofre-sorting
byusingtaxonomyand
generalization,%,
averagingover20
searches
Relevancyimprovement
forhybridapproach,
comp.tobaseline
(averagedforBing&
Yahoo)
3-4 word
phrases
noun phrase 86.7 85.4 87.1 93.5 93.6 1.088
verb phrase 83.4 82.9 79.9 92.1 92.8 1.116
how-to
expression
76.7 78.2 79.5 93.4 93.3 1.205
average 82.3 82.2 82.2 93.0 93.2 1.134
5-10 word
phrases
noun phrase 84.1 84.9 87.3 91.7 92.1 1.090
verb phrase 83.5 82.7 86.1 92.4 93.4 1.124
how-to
expression
82.0 82.9 82.1 88.9 91.6 1.111
average 83.2 83.5 85.2 91.0 92.4 1.108
2-3
sentences
one verb one
noun phrases
68.8 67.6 69.1 81.2 83.1 1.218
both verb
phrases
66.3 67.1 71.2 77.4 78.3 1.174
one sent of
how-to type
66.1 68.3 73.2 79.2 80.9 1.204
average 67.1 67.7 71.2 79.3 80.8 1.199
This is the
focus of this
study
The higher the
complexity of
query, the
stronger is the
contribution of
the hybrid
system
OpenNLP Contribution
There are four Java classes for building and running taxonomy:
TaxonomyExtenderSearchResultFromYahoo.java performs web mining, by
taking current taxonomy path, submitting formed keywords to Yahoo API web
search, obtaining snippets and possibly fragments of webpages, and extracting
commonalities between them to add the next node to taxonomy. Various
machine learning components for forming commonalities will be integrated in
future versions, maintaining hypotheses in various ways.
TaxoQuerySnapshotMatcher.java is used in real time to obtain a taxonomy-
based relevance score between a question and an answer.
TaxonomySerializer.java is used to write taxonomy in specified format: binary,
text or XML.
AriAdapter.java is used to import seed taxonomy data from a PROLOG
ontology; in future versions of taxonomy builder more seed formats and options
will be supported
Related Work
•Mapping to First Order Logic representations with a general prover and without
using acquired rich knowledge sources
•Semantic entailment [de Salvo Braz et al 2005]
•Semantic Role Labeling, for each verb in a sentence, the goal is to identify all
constituents that fill a semantic role, and to determine their roles, such as Agent,
Patient or Instrument [Punyakanok et al 2005].
•Generic semantic inference framework that operates directly on syntactic trees.
New trees are inferred by applying entailment rules, which provide a unified
representation for varying types of inferences [Bar-Haim et al 2005]
•Generic paraphrase-based approach for a specific case such as relation extraction
to obtain a generic configuration for relations between objects from text [Romano et
al 2006]
Conclusions
Ontologies are more sensitive way to match
keywords (compared to bag-of-words and TF*IDF)
When text for indexing includes abbreviations and
acronyms, and we don’t ‘know’ all mappings,
semantic analysis should be tolerant to omits of
some entities and still understand “what this text
fragment is about”.
Since we are unable to filter out noise “statistically”
like most NLP environments do, we have to rely on
ontologies.
Syntactic generalization takes bag-of-words and
pattern-matching classes of approaches to the next
level allowing to treat unknown words systematically
as long as their part of speech information is
available from context.
We proposed a taxonomy building mechanism
for a vertical domain, extending an approach
where a taxonomy is formed based on:
- specific semantic rules,
- specific semantic templates or
- a limited corpus of texts.
Relying on web search engine API for taxonomy
construction, we are leveraging not only the
whole web universe of texts, but also the
meanings formed by search engines as a result
of learning from user search sessions.
When a user selects certain search results, a
web search engine acquires a set of
associations between entities in questions and
entities in answers. These associations are then
used by our taxonomy learning process to find
adequate parameters for entities being learned
at a current taxonomy building step.

More Related Content

What's hot

5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
Bhaskar Mitra
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
Bhaskar Mitra
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
KU Leuven
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
KU Leuven
 

What's hot (20)

5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
Navigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept AnalysisNavigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept Analysis
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data Hub
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation
 

Viewers also liked

ESOPS the softer side 20150720
ESOPS the softer side 20150720ESOPS the softer side 20150720
ESOPS the softer side 20150720
sharonnieuwoudt
 
ARITRA_GHOSH_Resume_Updtd
ARITRA_GHOSH_Resume_UpdtdARITRA_GHOSH_Resume_Updtd
ARITRA_GHOSH_Resume_Updtd
Aritra Ghosh
 
office_2013_magyar_minta
office_2013_magyar_mintaoffice_2013_magyar_minta
office_2013_magyar_minta
Krist P
 
Diana Parra Network Scheduling
Diana Parra Network SchedulingDiana Parra Network Scheduling
Diana Parra Network Scheduling
Diana Parra
 
الدليل العلمي لنظام الدي سبيس
الدليل العلمي لنظام الدي سبيسالدليل العلمي لنظام الدي سبيس
الدليل العلمي لنظام الدي سبيس
Aml Sindi
 

Viewers also liked (20)

Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
ESOPS the softer side 20150720
ESOPS the softer side 20150720ESOPS the softer side 20150720
ESOPS the softer side 20150720
 
Edukacja ekologiczna na lekcjach biologii
Edukacja ekologiczna na lekcjach biologiiEdukacja ekologiczna na lekcjach biologii
Edukacja ekologiczna na lekcjach biologii
 
Khensani School Profile
Khensani School ProfileKhensani School Profile
Khensani School Profile
 
ARITRA_GHOSH_Resume_Updtd
ARITRA_GHOSH_Resume_UpdtdARITRA_GHOSH_Resume_Updtd
ARITRA_GHOSH_Resume_Updtd
 
office_2013_magyar_minta
office_2013_magyar_mintaoffice_2013_magyar_minta
office_2013_magyar_minta
 
Dealing the Cards
Dealing the CardsDealing the Cards
Dealing the Cards
 
F# in the real world (CodeMotion Dubai)
F# in the real world (CodeMotion Dubai)F# in the real world (CodeMotion Dubai)
F# in the real world (CodeMotion Dubai)
 
Diana Parra Network Scheduling
Diana Parra Network SchedulingDiana Parra Network Scheduling
Diana Parra Network Scheduling
 
Dst Crossmedia Rwa01
Dst Crossmedia Rwa01Dst Crossmedia Rwa01
Dst Crossmedia Rwa01
 
Espiral portafolio
Espiral portafolioEspiral portafolio
Espiral portafolio
 
Acontece Hoje
Acontece HojeAcontece Hoje
Acontece Hoje
 
Mobile App Design
Mobile App DesignMobile App Design
Mobile App Design
 
A machine learning approach to building domain specific search
A machine learning approach to building domain specific searchA machine learning approach to building domain specific search
A machine learning approach to building domain specific search
 
الدليل العلمي لنظام الدي سبيس
الدليل العلمي لنظام الدي سبيسالدليل العلمي لنظام الدي سبيس
الدليل العلمي لنظام الدي سبيس
 
Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance
 
Personal Income Tax 2016 Guide Part 7
Personal Income Tax 2016 Guide Part 7Personal Income Tax 2016 Guide Part 7
Personal Income Tax 2016 Guide Part 7
 
Machine Learning and Search -State of Search 2016
Machine Learning and Search -State of Search 2016 Machine Learning and Search -State of Search 2016
Machine Learning and Search -State of Search 2016
 
Procurement best practices orientation
Procurement best practices   orientationProcurement best practices   orientation
Procurement best practices orientation
 
XPDS16: Scope and Performance of Credit-2 Scheduler. - Anshul Makkar, Ctirix...
XPDS16:  Scope and Performance of Credit-2 Scheduler. - Anshul Makkar, Ctirix...XPDS16:  Scope and Performance of Credit-2 Scheduler. - Anshul Makkar, Ctirix...
XPDS16: Scope and Performance of Credit-2 Scheduler. - Anshul Makkar, Ctirix...
 

Similar to Automated building of taxonomies for search engines

HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
ijseajournal
 
Best Keyword Cover Search
Best Keyword Cover SearchBest Keyword Cover Search
Best Keyword Cover Search
1crore projects
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
FELIX75
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
vini89
 

Similar to Automated building of taxonomies for search engines (20)

Enhanced Methodology for supporting approximate string search in Geospatial ...
Enhanced Methodology for supporting approximate string search  in Geospatial ...Enhanced Methodology for supporting approximate string search  in Geospatial ...
Enhanced Methodology for supporting approximate string search in Geospatial ...
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
G04124041046
G04124041046G04124041046
G04124041046
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
 
Best Keyword Cover Search
Best Keyword Cover SearchBest Keyword Cover Search
Best Keyword Cover Search
 
ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS) ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
 
Hybrid geo textual index structure
Hybrid geo textual index structureHybrid geo textual index structure
Hybrid geo textual index structure
 
Using Word Embedding for Automatic Query Expansion
Using Word Embedding for Automatic Query ExpansionUsing Word Embedding for Automatic Query Expansion
Using Word Embedding for Automatic Query Expansion
 
Mcs 021
Mcs 021Mcs 021
Mcs 021
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query Understanding
 
master_thesis_greciano_v2
master_thesis_greciano_v2master_thesis_greciano_v2
master_thesis_greciano_v2
 
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATAEFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
 
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGLARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
 

Recently uploaded

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 

Recently uploaded (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 

Automated building of taxonomies for search engines

  • 1. Automated building of taxonomies for search engines Boris Galitsky & Greg Makowski .
  • 2. What can be a scalable way to automatically build a taxonomies of entities to improve search relevance? Taxonomy construction starts from the seed entities and mines the web for new entities associated with them. To form these new entities, machine learning of syntactic parse trees (syntactic generalization) is applied It form commonalities between various search results for existing entities on the web. Taxonomy and syntactic generalization is applied to relevance improvement in search and text similarity assessment in commercial setting; evaluation results show substantial contribution of both sources. Automated customer service rep. Q: Can you reactivate my card which I am trying to use in Nepal? A: We value you as a customer… We will cancel your card… New card will be mailed to your California address … A child with severe form of autism Q: Can you give your candy to my daughter who is hungry now and is about to cry? A: No, my mom told me not to feed babies. Its wrapper is nice and blue. I need to wash my hands before I eat it … … Entities need to make sense together Why ontologies are needed for search Human and auto agents having difficulties processing texts if required ontologies are missing
  • 3. Knowing how entities are connected would improve search results Condition “active paddling” is ignored or misinterpreted, although Google knows that it is a valid combination (‘paddling’ can be ‘active’) • In the above example “white water rafting in Oregon with active paddling with kids” active is meaningless without paddling. • So if the system can’t find answers with ‘active paddling’, try finding with ‘paddling’, but do not try finding with ‘active’ but without ‘paddling’.
  • 4. Difficulty in building taxonomies Building, tuning and managing taxonomies and ontologies is rather costly since a lot of manual operations are required. A number of studies proposed automated building of taxonomies based on linguistic resources and/or statistical machine learning, (Kerschberg et al 2003, Liu &Birnbaum 2008, Kozareva et al 2009). However, most of these approaches have not found practical applications due to: – insufficient accuracy of resultant search, – limited expressiveness of representations of queries of real users, – high cost associated with manual construction of linguistic resources and their limited adjustability. The main challenge in building a taxonomy tree is to make it as deep as possible to incorporate longer chains of relationships, so more specific and more complicated questions can be answered. • It is based on initial set of key entities (a seed) for given vertical knowledge domain. • This seed is then automatically extended by mining of web documents which include a meaning of a current taxonomy node. • This node is further extended by entities which are the results of inductive learning of commonalities between these documents. • These commonalities are extracted using an operation of syntactic generalization, which finds the common parts of syntactic parse trees of a set of documents, obtained for the current taxonomy node. We propose automated taxonomy building mechanism Therefore automated or semi-automated approach is required for practical apps
  • 5. Contribution Proposed taxonomy learning algorithm aims to improve vertical search relevance and will be evaluated in a number of search-related tasks The contribution of this study is three-fold: • Propose and implement a mechanism for using taxonomy trees for the deterministic classification of answers as relevant and irrelevant. • Implement an algorithm to automate the building of such a taxonomy for a vertical domain, given a seed of key entities. • Design a domain-independent linguistic engine for finding commonalities/similarities between texts, based on parse trees, to support 1 and 2. A number of currently available general- purpose resources, such as DBPEdia, Freebase, and Yago, assist entity-related searches but are insufficient to filter out irrelevant answers that concern a certain activity with an entity and its multiple parameters. A set of vertical ontologies, such as last.fm for artists, are also helpful for entity-based searches in vertical domains; however, their taxonomy trees are rather shallow, and their usability for recognizing irrelevant answers is limited.
  • 6. For a query q with keywords {a b c} and its arbitrary relevant answer A we define that query is about b, is-about(q, {b}) if queries {a b} and {b c} are relevant or marginally relevant to A, and {ac} is irrelevant to A. Our definition of query understanding, which is rather narrow, is the ability to say which keywords in the query are essential (such as b in the above example), so that without them the other query terms become meaningless. Also, an answer which does not contain b is irrelevant to the query which includes b. • For example, is-about({machine, learning, algorithm}, {machine, learning}), is-about({machine, learning, algorithm}, {algorithm}), is-about({machine, learning, algorithm}, {learning, algorithm}), but not is-about({machine, learning, algorithm}, {machine}) • For query {a b c d}, if b is essential (is-about({a b c d}, {b}), c can also be essential when b is in the query such that {a b c}, {b c d}, {b c} are relevant, even {a b}, {b d} are (marginally) relevant, but {a d} is not (is-about({a b c d}, {b c}). Hence for a query {abcd} and two answers (snippets) {bcd…efg} and {acd…efg}, the former is relevant and the latter is not. Must-occur keywords Achieving relevancy using a taxonomy is based on totally different mechanism than a conventional TF∗IDF based search. In the latter, importance of terms is based on the frequency of occurrence. For an NL query (not a Boolean query) any term can be omitted in the search result if the rest of terms give acceptable relevancy score. In case of a Boolean query, this is true for each of its conjunctive member. However, in the taxonomy- based search we know which terms should occur in the answer and which terms must occur there, otherwise the search result becomes irrelevant.
  • 7. Let us consider a totality of keywords in a domain D; these keywords occur in questions and answers in this domain. There is always a hierarchy of these keywords: some are always more important than others in a sense of is-about relation. • Keyword tax is more important than deduction, individual, return, business. • is-about({tax deduction}, {tax}) but not is-about({tax deduction}, {deduction}), since without context of tax keyword deduction is ambiguous; • is-about({individual, tax, return}, {tax, return}) but not is-about({individual, tax, return}, {individual}), since individual acquires sense as an adjective only in the context of tax. • At the same time, the above keywords are more important than partial cases of the situations they denote such as submission deadline; • is-about({individual, tax, return, submission, deadline}, {individual, tax, return}) but not is-about({individual, tax, return, submission, deadline}, {submission, deadline}) because submission deadline may refer to something totally different Hierarchy of must-occur keywords We introduce a partial order on the set of subsets of keywords K1, K2 ∈ 2D K1 > K2 iff is-about(K1 ∪ K2, K1) but not is-about(K1 ∪ K2, K2). We say that a path Tp covers a query Q if the set of keywords for the nodes of Tp is a super-set of Q. If multiple paths cover a query Q producing different intersections Q ∩ Tp then this query has multiple meanings in the domain; for each such meaning a separate set of acceptable answers is expected.
  • 8. An answer ai  A is acceptable if it includes all essential (according to is_about) keywords from the query Q as found in the taxonomy path Tp T. For any taxonomy path Tp which covers the question q (intersections of their keywords is not empty), these intersection keywords must be in the acceptable answer ai.  Tp T: Tp ∩ X  Xi  Tp ∩ X. Answer acceptable given taxonomy
  • 9. For a question (Q) "When can I file extension of time for my tax return?" let us imagine two answers: • (A1) "You need to file form 1234 to request a 4 month extension of time to file your tax return" • (A2) "You need to download file with extension 'pdf', print and complete it to do your taxes". We expect the closest taxonomy path to be: (T) tax - file-return - extension-of-time. tax is a main entity, file-return we expect to be in the seed, and extension-of-time would be the learned entity, so A1 will match with taxonomy and is an acceptable answer, and A2 is not. A question and two answers
  • 10. Relevance verification algorithm input: query Q output: the best answer abest and the set of acceptable answers Aa 1) For a query Q, obtain a set of candidate answers A by available means (using keywords, using internal index, or using external index of search engine’s APIs); 2) Find a path of taxonomy Tp which covers maximal number of terms in Q, along with other paths, which cover Q, to form a set P = { Tp1, Tp2, …}. Unless acceptable answer is found: 3) Compute the set Tp ∩ Q. For each answer ai  A 4) Compute ai ∩( Tp ∩ Q)) and test if all essential words from the query, which exists in Tp are also in the answer (acceptability test) 5) Compute similarity score of Q with or each ai 6) Compute the best answer abest and the set of acceptable answers Aa. If no acceptable answer found, return to 2) for the next path from P. 7) Return abest and the set of acceptable answers Aa if available.
  • 11. Providing multiple answers as a result of default reasoning Facts Si , comprising the query representation (occurrences of words in a query) Default rules, establishing the meanings of words based on the other words and the meanings that have been established Successful & closed process: extension @S1, @S2 ,…  answer 1 Successful & closed process: extension @S3, @S1 ,…  answer 2 Either unsuccessful or non-closed process: No extension Using default logic to handle ambiguity in search Building extensions of default theory for each meaning
  • 12. A simplified step 1 of ontology learning Currently available: tax – deduct 1) Get search results for currently available expressions 2) Select attributes based on their linguistic occurrence (shown in yellow) 3) Find common attributes (commonalities between search results, shown in red, like ‘overlook’). 4) Extend the taxonomy path by adding newly acquired attribute Tax-deduct-overlook
  • 13. Step 2 of ontology learning (more details) Currently available taxonomy path: tax – deduct - overlook 1) Get search results 2) Select attributes based on their linguistic occurrence (modifiers of entities from the current taxonomy path) 3) Find common expressions between search results as syntactic generalization, like ‘PRP-mortgage’ 4) Extend the taxonomy path by adding newly acquired attribute Tax-deduct-overlook- mortgage, Tax-deduct- overlook – no_itemize …
  • 14. Step 3 of ontology learning Currently available taxonomy path: tax – deduct – overlook- mortgage 1) Get search results 2) Perform syntactic generalization, finding common maximal parse sub-trees excluding the current taxonomy path 3) If nothing in common any more, this is the taxonomy leave (stop growing the current path). Possible learning results (taxonomy fragment)
  • 15. If a keyword is in a query, and in the closest taxonomy path, it HAS TO BE in the answerQuery: can I deduct tax on mortgage escrow account: Closest taxonomy path: tax – deduct – overlook- mortgage- escrow_account Then keywords/multiwords have to be in the answer: {deduct ,tax , mortgage escrow_account } Wrong answers
  • 16. sell_hobby=>[[deductions, collection], [making, collection], [sales, business, collection], [collectibles, collection], [loss, hobby, collection], [item, collection], [selling, business, collection], [pay, collection], [stamp, collection], [deduction, collection], [car, collection], [sell, business, collection], [loss, collection]] benefit=>[[office, child, parent], [credit, child, parent], [credits, child, parent], [support, child, parent], [making, child, parent], [income, child, parent], [resides, child, parent], [taxpayer, child, parent], [passed, child, parent], [claiming, child, parent], [exclusion, child, parent], [surviving, benefits, child, parent], [reporting, child, parent], hardship=>[[apply, undue], [taxpayer, undue], [irs, undue], [help, undue], [deductions, undue], [credits, undue], [cause, undue], [means, required, undue], [court, undue]. Taxonomy fragment
  • 17. Improving the precision of text similarity: articles, blogs, tweets, images and videos We verify if an image belongs here, based on its caption Using syntactic generalization to access relevance
  • 18. Generalizing two sentences and its application Improvement of search relevance by checking syntactic similarity between query and sentences in search hits. Syntactic similarity is measured via generalization. Such syntactic similarity is important when a search query contains keywords which form a phrase , domain-specific expression, or an idiom, such as “shot to shot time”, “high number of shots in a short amount of time”. Based on syntactic similarity, search results can be re-sorted based on the obtained similarity score Based on generalization, we can distinguish meaningful (informative) and meaningless (uninformative) opinions, having collected respective datasets Meaningful sentence to be shown as search result Not very meaningful sentence to be shown, even if matches the search query
  • 19. Generalizing sentences & phrases noun phrase [ [JJ-* NN-zoom NN-* ], [JJ-digital NN-camera ]] About ZOOM and DIGITAL CAMERA verb phrase [ [VBP-* ADJP-* NN-zoom NN-camera ], [VB-* NN- zoom IN-* NN-camera ]] To do something with ZOOM –…- CAMERA prepositional phrase [ [IN-* NN-camera ], [IN-for NN-* ]] With/for/to/in CAMERA, FOR something Obtain parse trees. Group by sub-trees for each phrase type Extend list of phrases by paraphrasing (semantically equivalent expressions) For every phrase type For each pair of tree lists, perform pair-wise generalization For a pair of trees, perform alignment For a pair of words (nodes), generalize them Remove more general trees (if less general exist) from the resultant list VP [VB-use DT-the JJ-digital NN-zoom IN-of DT-this NN- camera IN-for VBG-filming NNS-insects ] + VP [VB-get JJ-short NN-focus NN-zoom NN-lens IN-for JJ- digital NN-camera ] = [VB-* JJ-* NN-zoom NN-* IN-for NNS-* ] score = score(NN) + score(PREP) + 3*score(<POS*>) Meaning: “Do-something with some-kind-of ZOOM something FOR something-else” Generalizing phrasesDeriving a meaning by generalization Generalization: from words to phrases to sentences to paragraphs Syntactic generalization helps with microtext when ontology use is limited
  • 20. Learning similarity between syntactic trees 1. Obtain parsing tree for each sentence. For each word (tree node) we have lemma, part of speech and form of word information, as well as an arc to the other node. 2. Split sentences into sub-trees which are phrases for each type: verb, noun, prepositional and others; these sub-trees are overlapping. The sub-trees are coded so that information about occurrence in the full tree is retained. 3. All sub-trees are grouped by phrase types. 4. Extending the list of phrases by adding equivalence transformations Generalize each pair of sub-trees for both sentences for each phrase type. 5. For each pair of sub-trees yield the alignment, and then generalize each node for this alignment. For the obtained set of trees (generalization results), calculate the score. 6. For each pair of sub-trees for phrases, select the set of generalizations with highest score (least general). 7. Form the sets of generalizations for each phrase types whose elements are sets of generalizations for this type. 8. Filtering the list of generalization results: for the list of generalization for each phrase type, exclude more general elements from lists of generalization for given pair of phrases. Generalization of semantic role expressions Generalization algorithm *
  • 21. Evaluation Media/ method of text similarity assessment Full size news articles Abstracts of articles Blog posting Commen ts Images Videos Frequencies of terms in documents 29.3% 26.1% 31.4% 32.0% 24.1% 25.2% Syntactic generalization 17.8% 18.4% 20.8% 27.1% 20.1% 19.0% Taxonomy- based 45.0% 41.7% 44.9% 52.3% 44.8% 43.1% Hybrid (taxonomy + syntactic) 13.2% 13.6% 15.5% 22.1% 18.2% 18.0% Hybrid approach improves text similarity/relevance assessment Ordering of search results based on generalization, taxonomy, and conventional search engine Classification of short texts
  • 22. Evaluation in vertical search domain Query phrase sub- type Relevancyofbaseline Yahoosearch,%, averagingover20 searches Relevancyofbaseline Bingsearch,%,averaging over20searches Relevancyofre-sorting bygeneralization,%, averagingover20 searches Relevancyofre-sorting byusingtaxonomy,%, averagingover20 searches Relevancyofre-sorting byusingtaxonomyand generalization,%, averagingover20 searches Relevancyimprovement forhybridapproach, comp.tobaseline (averagedforBing& Yahoo) 3-4 word phrases noun phrase 86.7 85.4 87.1 93.5 93.6 1.088 verb phrase 83.4 82.9 79.9 92.1 92.8 1.116 how-to expression 76.7 78.2 79.5 93.4 93.3 1.205 average 82.3 82.2 82.2 93.0 93.2 1.134 5-10 word phrases noun phrase 84.1 84.9 87.3 91.7 92.1 1.090 verb phrase 83.5 82.7 86.1 92.4 93.4 1.124 how-to expression 82.0 82.9 82.1 88.9 91.6 1.111 average 83.2 83.5 85.2 91.0 92.4 1.108 2-3 sentences one verb one noun phrases 68.8 67.6 69.1 81.2 83.1 1.218 both verb phrases 66.3 67.1 71.2 77.4 78.3 1.174 one sent of how-to type 66.1 68.3 73.2 79.2 80.9 1.204 average 67.1 67.7 71.2 79.3 80.8 1.199 This is the focus of this study The higher the complexity of query, the stronger is the contribution of the hybrid system
  • 23. OpenNLP Contribution There are four Java classes for building and running taxonomy: TaxonomyExtenderSearchResultFromYahoo.java performs web mining, by taking current taxonomy path, submitting formed keywords to Yahoo API web search, obtaining snippets and possibly fragments of webpages, and extracting commonalities between them to add the next node to taxonomy. Various machine learning components for forming commonalities will be integrated in future versions, maintaining hypotheses in various ways. TaxoQuerySnapshotMatcher.java is used in real time to obtain a taxonomy- based relevance score between a question and an answer. TaxonomySerializer.java is used to write taxonomy in specified format: binary, text or XML. AriAdapter.java is used to import seed taxonomy data from a PROLOG ontology; in future versions of taxonomy builder more seed formats and options will be supported
  • 24. Related Work •Mapping to First Order Logic representations with a general prover and without using acquired rich knowledge sources •Semantic entailment [de Salvo Braz et al 2005] •Semantic Role Labeling, for each verb in a sentence, the goal is to identify all constituents that fill a semantic role, and to determine their roles, such as Agent, Patient or Instrument [Punyakanok et al 2005]. •Generic semantic inference framework that operates directly on syntactic trees. New trees are inferred by applying entailment rules, which provide a unified representation for varying types of inferences [Bar-Haim et al 2005] •Generic paraphrase-based approach for a specific case such as relation extraction to obtain a generic configuration for relations between objects from text [Romano et al 2006]
  • 25. Conclusions Ontologies are more sensitive way to match keywords (compared to bag-of-words and TF*IDF) When text for indexing includes abbreviations and acronyms, and we don’t ‘know’ all mappings, semantic analysis should be tolerant to omits of some entities and still understand “what this text fragment is about”. Since we are unable to filter out noise “statistically” like most NLP environments do, we have to rely on ontologies. Syntactic generalization takes bag-of-words and pattern-matching classes of approaches to the next level allowing to treat unknown words systematically as long as their part of speech information is available from context. We proposed a taxonomy building mechanism for a vertical domain, extending an approach where a taxonomy is formed based on: - specific semantic rules, - specific semantic templates or - a limited corpus of texts. Relying on web search engine API for taxonomy construction, we are leveraging not only the whole web universe of texts, but also the meanings formed by search engines as a result of learning from user search sessions. When a user selects certain search results, a web search engine acquires a set of associations between entities in questions and entities in answers. These associations are then used by our taxonomy learning process to find adequate parameters for entities being learned at a current taxonomy building step.