14. Michael Oakes (UoW) Natural Language Processing for Translation


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

14. Michael Oakes (UoW) Natural Language Processing for Translation

  1. 1. Information Retrieval (from EBMT examples) Michael P. Oakes University of Wolverhampton Birmingham Winter School, 2013
  2. 2. Finding Out About (FOA)  “Finding Out About”, by Richard Belew (2000), Cambridge University Press.  Finding Out About (FOA) - research activities that allow a decision maker to draw on others’ knowledge, especially the WWW = “Information Retrieval”.  A library (or WWW) contains many books (“documents”) on many topics. The authors are typically people far from our time or place, using language similar but not identical to our own. We must FOA a topic of special interest, looking only for those things which are relevant to our search. This basic skill is a fundamental part of an academic’s job:
  3. 3. FOA has 3 phases:  asking a question  constructing an answer  assessing an answer
  4. 4. 1. Asking a question  Users of a search engine may be aware of a specific gap in their knowledge, and are motivated to fill it (meta-cognition about ignorance). They may not be able to articulate their knowledge gap. Forming a clearly posed question is the hardest part of answering it! This common cognitive state is the user’s information need. User’s try to take their ill-defined, internal cognitive state and turn it into an external expression of their question. This external expression is called the query, and the language in which it is constructed the query language.
  5. 5. 2. Constructing an answer  A human question-answerer might consider:  can they translate the user’s ill-formed query into a better one?  do they know the answer themselves?  are they able to verbalise this answer in terms the user will understand?  can they provide the necessary background knowledge for the user to understand the answer itself?  Current search engines are slightly more limited in scope. The search engine has available to it only a preexisting set of “canned” texts (although this may be very large), and its response is limited to identifying one or more of these passages and presenting them to the users.
  6. 6. 3. Assessing the answer  Assessing the answer  We would normally give feedback to a human answerer, e.g. “That isn’t what I meant”, “Let me ask it another way”, “That helps, but I still have this problem” or “What does that have to do with anything?”. So we “close the loop” when the user provides an assessment of how relevant the found the answer provided. In an automatic system this is relevance feedback - the user reacts to each retrieved document as “relevant”, “not relevant” or “neutral”.  See fig 1.4. The three steps in a computerised, algorithmic context, information retrieval.
  7. 7. Keywords  The fundamental operation performed by a search engine is a match between descriptive features mentioned by users in their queries, and documents sharing those features. By far the most important kind of features are keywords.  Keywords are linguistic atoms - typically words, pieces of words, or phrases - used to characterise the subject or content of the document.  They are pivotal because they must bridge the gap between the users’ characterisation of their information need (queries) and the characterisation of the documents’ topical focus against which these will be matched.  Contrast natural language queries with bag-of-words queries.
  8. 8. Keywords as document descriptors  Keywords are also used as document descriptors.  Indexing is the process of associating one or more keywords with each document.  The vocabulary used can either be controlled or uncontrolled (also known as closed or open). If we organise a conference, and ask the authors of each paper to index it manually using only terms on a fixed list of potential keywords, this is a closed vocabulary.
  9. 9. Query syntax  Query syntax. A typical search engine query consists of 2 to 3 words.  Queries defined only as sets of keywords are simple queries - most search engines use this “bag of words” approach.  Other possibilities exist e.g. Boolean operators and / or / not e.g. “neural networks AND speech recognition”.  Verb(subject, object) triples e.g. “aspirin treats blood_clotting”
  10. 10. Document length  Document length. Longer documents can discuss more topics, and hence be associated with more keywords, and thus are more likely to be retrieved.  This means we must normalise documents’ indices in some way to compensate for differing lengths.  We also assume that the smallest unit of text with appreciable “aboutness” is the paragraph, and larger documents are constructed of a number of paragraphs.
  11. 11. Stemming  Stemming aims to remove surface markings (such as number) to reveal a root form  Using a token’s root form as an index term can give robust retrieval even when the query contains the plural CARS while the document contains the singular CAR  Linguists distinguish inflectional morphology (plurals, third person singular, past tense, -ing) from derivational morphology (e.g. teach (verb), teacher (noun)). Weak vs. strong stemming.
  12. 12. Example stemming rules  (.*)SSES  /1SS: PERL-like syntax to say that strings ending in –SSES should be transformed by taking the stem (characters before –SSES) and adding only the two characters –SS.  (.*)IES  /1Y  A complete stemmer contains many such rules (60 in Lovins’ set), and a regime for handling conflicts when multiple rules match the same token, e.g. longest match, rule order.
  13. 13. Pros and Cons of Stemming  Reduces the size of the keyword vocabulary, allowing compression of the index files of 10 – 50%.  Increases recall – a query on FOOTBALL now also finds documents on FOOTBALLER(S), FOOTBALLING.  Reduces precision – stripping away morphological features may obscure differences in word meanings. For example, GRAVITY has two senses (earth’s pull, seriousness). GRAVITATION can only refer to earth’s pull – but if we stem it to GRAVITY, it could mean either.
  14. 14. Calculating TF-IDF weighting 
  15. 15. The Vector Space Model
  16. 16. The cosine similarity measure 
  17. 17. How well are we doing (1)?  Evaluation of search engines is notoriously difficult. However, we have two measures of objective assessment. The first step is to focus on a particular query.  We identify the set of documents Rel that are determined to be relevant to it (subjective!).  A perfect search engine would retrieve all and only the documents in Rel.  See fig. 1.10
  18. 18. Recall  Clearly, the number of documents that were designated both relevant and retrieved, Retr ∩ Rel will be a key measure of success.  But we must compare the size of the set |Retr ∩ Rel| to something.  If we were very concerned that the search engine retrieve every relevant document (e.g. every prior ruling relevant to a judicial case) , we should compare the intersection to the number of documents marked as relevant |Rel|.  This measure is known as recall = |Retr ∩ Rel| / |Rel| :
  19. 19. Precision  However, we might instead be worried about how much of what we see is relevant (search engine users want a lot of relevant hits on the first page), so an equally reasonable standard of comparison is precision, the proportion of retrieved documents which are in fact relevant:  P = |Retr ∩ Rel| |Retr|
  20. 20. Retrieved versus Relevant Docs High Recall Retrieval Retrieved Relevant High Precision Retrieval
  21. 21. Conclusion: IR for short documents like translation examples  Prior annotation of the corpus of past translations (Clifton and Teahan, 2004, QA systems)  Similarity measures for short segments of text: stemming to increase recall, then document expansion, Kullback-Leibler Divergence as a similarity measure (Metzler et al., 2007).  Tao Tao et al (2006). Need for short text matching: query/image caption, sponsored search: query/ad keyword; query reformulation: query / query similarity  Document expansion. Rohit Gupta expands the documents with all entries in the PPDB Paraphrase database: lexical, phrasal and syntactic.  Relevance feedback? (“more like this”).