14. Michael Oakes (UoW) Natural Language Processing for Translation
(from EBMT examples)
Michael P. Oakes
University of Wolverhampton
Birmingham Winter School, 2013
Finding Out About (FOA)
“Finding Out About”, by Richard Belew (2000),
Cambridge University Press.
Finding Out About (FOA) - research activities that
allow a decision maker to draw on others’ knowledge,
especially the WWW = “Information Retrieval”.
A library (or WWW) contains many books (“documents”)
on many topics. The authors are typically people far
from our time or place, using language similar but not
identical to our own. We must FOA a topic of special
interest, looking only for those things which are
relevant to our search. This basic skill is a fundamental
part of an academic’s job:
FOA has 3 phases:
asking a question
constructing an answer
assessing an answer
1. Asking a question
Users of a search engine may be aware of a specific
gap in their knowledge, and are motivated to fill it
(meta-cognition about ignorance). They may not be able
to articulate their knowledge gap. Forming a clearly
posed question is the hardest part of answering it! This
common cognitive state is the user’s information need.
User’s try to take their ill-defined, internal cognitive state
and turn it into an external expression of their question.
This external expression is called the query, and the
language in which it is constructed the query
2. Constructing an answer
A human question-answerer might consider:
can they translate the user’s ill-formed query into a
do they know the answer themselves?
are they able to verbalise this answer in terms the user
can they provide the necessary background knowledge
for the user to understand the answer itself?
Current search engines are slightly more limited in
scope. The search engine has available to it only a preexisting set of “canned” texts (although this may be very
large), and its response is limited to identifying one or
more of these passages and presenting them to the
3. Assessing the answer
Assessing the answer
We would normally give feedback to a human answerer,
e.g. “That isn’t what I meant”, “Let me ask it another
way”, “That helps, but I still have this problem” or “What
does that have to do with anything?”. So we “close the
loop” when the user provides an assessment of how
relevant the found the answer provided. In an
automatic system this is relevance feedback - the user
reacts to each retrieved document as “relevant”, “not
relevant” or “neutral”.
See fig 1.4. The three steps in a computerised,
algorithmic context, information retrieval.
The fundamental operation performed by a search
engine is a match between descriptive features
mentioned by users in their queries, and documents
sharing those features. By far the most important kind
of features are keywords.
Keywords are linguistic atoms - typically words, pieces
of words, or phrases - used to characterise the subject
or content of the document.
They are pivotal because they must bridge the gap
between the users’ characterisation of their information
need (queries) and the characterisation of the
documents’ topical focus against which these will be
Contrast natural language queries with bag-of-words
Keywords as document descriptors
Keywords are also used as document
Indexing is the process of associating one or
more keywords with each document.
The vocabulary used can either be controlled
or uncontrolled (also known as closed or
open). If we organise a conference, and ask
the authors of each paper to index it manually
using only terms on a fixed list of potential
keywords, this is a closed vocabulary.
Query syntax. A typical search engine
query consists of 2 to 3 words.
Queries defined only as sets of keywords
are simple queries - most search engines
use this “bag of words” approach.
Other possibilities exist e.g. Boolean
operators and / or / not e.g. “neural
networks AND speech recognition”.
Verb(subject, object) triples e.g. “aspirin
Document length. Longer documents can
discuss more topics, and hence be associated
with more keywords, and thus are more likely
to be retrieved.
This means we must normalise documents’
indices in some way to compensate for
We also assume that the smallest unit of text
with appreciable “aboutness” is the paragraph,
and larger documents are constructed of a
number of paragraphs.
Stemming aims to remove surface markings
(such as number) to reveal a root form
Using a token’s root form as an index term can
give robust retrieval even when the query
contains the plural CARS while the document
contains the singular CAR
Linguists distinguish inflectional morphology
(plurals, third person singular, past tense, -ing)
from derivational morphology (e.g. teach
(verb), teacher (noun)). Weak vs. strong
Example stemming rules
(.*)SSES /1SS: PERL-like syntax to say that
strings ending in –SSES should be transformed
by taking the stem (characters before –SSES)
and adding only the two characters –SS.
A complete stemmer contains many such rules
(60 in Lovins’ set), and a regime for handling
conflicts when multiple rules match the same
token, e.g. longest match, rule order.
Pros and Cons of Stemming
Reduces the size of the keyword vocabulary, allowing
compression of the index files of 10 – 50%.
Increases recall – a query on FOOTBALL now also
finds documents on FOOTBALLER(S), FOOTBALLING.
Reduces precision – stripping away morphological
features may obscure differences in word meanings.
For example, GRAVITY has two senses (earth’s pull,
seriousness). GRAVITATION can only refer to earth’s
pull – but if we stem it to GRAVITY, it could mean either.
How well are we doing (1)?
Evaluation of search engines is notoriously
difficult. However, we have two measures of
objective assessment. The first step is to
focus on a particular query.
We identify the set of documents Rel that are
determined to be relevant to it (subjective!).
A perfect search engine would retrieve all and
only the documents in Rel.
See fig. 1.10
Clearly, the number of documents that were designated
both relevant and retrieved, Retr ∩ Rel will be a key
measure of success.
But we must compare the size of the set
|Retr ∩ Rel| to something.
If we were very concerned that the search engine
retrieve every relevant document (e.g. every prior ruling
relevant to a judicial case) , we should compare the
intersection to the number of documents marked as
This measure is known as recall = |Retr ∩ Rel| / |Rel| :
However, we might instead be worried about how much of
what we see is relevant (search engine users want a lot of
relevant hits on the first page), so an equally reasonable
standard of comparison is precision, the proportion of
retrieved documents which are in fact relevant:
P = |Retr ∩ Rel| |Retr|
Retrieved versus Relevant Docs
High Recall Retrieval
High Precision Retrieval
Conclusion: IR for short documents like
Prior annotation of the corpus of past translations (Clifton and
Teahan, 2004, QA systems)
Similarity measures for short segments of text: stemming to
increase recall, then document expansion, Kullback-Leibler
Divergence as a similarity measure (Metzler et al., 2007).
Tao Tao et al (2006). Need for short text matching:
query/image caption, sponsored search: query/ad keyword;
query reformulation: query / query similarity
Document expansion. Rohit Gupta expands the documents
with all entries in the PPDB Paraphrase database: lexical,
phrasal and syntactic.
Relevance feedback? (“more like this”).