Published on


Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Should computers in science match computer science?
  • Would yield the phrases automatic analysis text analysis scientific text but not scientific analysis
  • Would yeild the phrases automatic analysis text analysis scientific text but also less desirable phrases these texts four texts
  • LOC is a locality. There is an opacity affecting the lung [and] an opacity affecting the trachea, with certainty [p] (probable)
  • My understanding of this system is as follows. When retrieving relative to a query (expressed in the same form as the document) we want to see if d->q. It is quite likely that using standard logical transformations d won’t directly imply q, therefore the writers of this system have provided a series of logical transforms that introduce an amount of uncertainty in the transformation. Using these transformations, d will imply q with a certain probability.
  • Here we are transforming a document from saying “ An opacity of the lung has a blurred contour” to “ An opacity has a blurred contour”
  • Further transformations provided through knowledge of what words mean. Notice, that this only works for this limited domain as there is little ambiguity in these words in this domain. In a wider domain, documents about cancer could be about the astrological sign.
  • Nlp

    1. 1. Artificial intelligence & natural language processing Mark Sanderson Porto, 2000
    2. 2. Aims • To provide an outline of the attempts made at using NLP techniques in IR
    3. 3. Objectives • At the end of this lecture you will be able to – Outline a range of attempts to get NLP to work with IR systems – Idly speculate on why they failed – Describe the successful use of NLP in a limited domain
    4. 4. Why? • Seems an obvious area of investigation – Why not working?
    5. 5. Use of NLP • Syntactic – Parsing to identify phrases – Full syntactic structure comparison • Semantic – Building an understanding of a document’s content • Discourse – Exploiting document structure?
    6. 6. Syntactic • Parsing to identify phrases – The issues. – Explain how it’s done (a bit). – Is it worth it? • Other possibilities – Grammatical tagging – Full syntactic structure comparison • Explain how it’s done (a little bit). • Show results.
    7. 7. Simple phrase identification • High frequency terms could be good candidates. – Why? • Terms co-occurring more often than chance. – Within small number of words. – Surrounding simple terms. – Not surrounding punctuation.
    8. 8. Problems • Close words that aren’t phrases. • “the use of computers in science & technology” • Distant words that are phrases. • “preparation & evaluation of abstracts and extracts”
    9. 9. Parsing for phrases • Using parsers to identify noun phrases. • Make a phrase out of a head and the head of its modifiers. “automatic analysis of scientific text” ADJADJ NOUNNOUN PREP NP PP
    10. 10. Errors • Not a perfect rule by any means. – Need restrictions to eliminate bogus phrases. “automatic analysis of these four scientific texts” ADJADJ NOUNNOUN PREP NP PP DET QUANT
    11. 11. Do they work? • Fagan compared statistical with syntactic, statistics won, just – J. Fagan (1987) Experiments in phrase indexing for document retrieval: a comparison of syntactic & nonsyntactic methods, in TR 87-868 - Department of Computer Science, Cornell University • More research has been conducted. – T. Strzalkowski (1995) Natural language information retrieval, in Information Processing & Management, Vol. 31, No. 3, pp 397-417
    12. 12. Check out TREC • Overview of the Seventh Text REtrieval Conference (TREC-7), E. Voorhees, D. Harman (National Institute of Standards and Technology) – – Ad hoc track • Fairly even between statistical phrases, syntactic phrases and no phrases.
    13. 13. Grammatical tagging? • Tag document text with grammatical codes? – R. Garside (1987). The CLAWS word tagging system, in The computational analysis of english: a corpus based approach, R. Garside, G. Leech, G. Sampson Eds., Longman: 30-41. • Doesn’t appear to work – R. Sacks-Davis, P. Wallis, R. Wilkinson (1990). Using syntactic analysis in a document retrieval system that uses signature files, in Proceedings of 13th ACM SIGIR Conference: 179-191.
    14. 14. Syntactic structure comparison • Has been tried… – A. F. Smeaton & P. Sheridan (1991) Using morpho- syntactic language analysis in phrase matching, in Proceedings of RIAO ‘91, Pages 414-429 • Method – Parse sentences into tree structures – When you get a phrase match • Look at linking syntactic operator. • Look at the residual tree structure that didn’t match • Does not to work
    15. 15. Semantic • Disambiguation – Given a word appearing in a certain context, disambiguators will tell you what sense it is. • IR system – Index document collections by senses rather than words – Ask the users what senses the query words are – Retrieve on senses
    16. 16. Disambiguation • Does it work? – No (well maybe) • M. Sanderson, Word sense disambiguation and information retrieval, in Proceedings of the 17th ACM SIGIR Conference, Pages 142-151, 1994 • M. Sanderson & C.J. van Rijsbergen, The impact on retrieval effectiveness of skewed frequency distributions, in ACM Transactions on Information Systems (TOIS) Vol. 17 No. 4, 1999, Pages 440-465.
    17. 17. Partial conclusions • NLP has yet to prove itself in IR – Agree – D.D. Lewis & K. Sparck-Jones (1996) Natural language processing for information retrieval, in Communications of the ACM (CACM) 1996 Vol. 39, No. 1, 92-101 – Sort of don’t agree – A. Smeaton (1992) Progress in the application of natural language processing to information retrieval tasks, in The Computer Journal, Vol. 35, No. 3.
    18. 18. Mark’s idle speculation • What people think is going on always Keywords NLP
    19. 19. Mark’s idle speculation • What’s usually actually going on Keywords NLP
    20. 20. Areas where NLP does work • Systems with the following ingredients. – Collection documents cover small domain. – Language use is limited in some manner. – User queries cover tight subject area. – Documents/queries very short • Image captions – LSI, pseudo-relevance feedback – People willing to spend money getting NLP to work
    21. 21. RIME & IOTA • From Grenoble – Y. Chiaramella & J. Nie (1990) A retrieval model based on an extended modal logic and its application to the RIME experimental approach, in Proceedings of the 13th SIGIR conference, Pages 25-43 • Medical record retrieval system • Some database’y parts • Free text descriptions of cases
    22. 22. Indexing • “an opacity affecting probably the lung and the trachea” {[p], SGN} {[bears-on], SGN} {[and], SGN} {[bears-on], SGN} {[lung], LOC}{[opacity], SGN} {[opacity], SGN} {[trachea], LOC} LOC - localisation SGN - observed sign
    23. 23. Retrieval • How do we match a user’s query to these structures? – Using transformations - bit like logic. {[bears-on], SGN} {[lung], LOC}{[opacity], SGN} t - uncertainty {[lung], LOC}, t {[opacity], SGN}, t⇒ ⇒
    24. 24. Tree transformation {[bears-on], SGN} {[has-for-value], SGN} {[has-for-value], SGN} {[lung], LOC}{[opacity], SGN} {[contour], SGN} {[blurred], LOC} {[opacity], SGN} {[has-for-value], SGN}, t {[has-for-value], SGN} {[contour], SGN} {[blurred], LOC} ⇒
    25. 25. Term transforms • Basic medical terms stored in a hierarchy. – Transformations possible again with uncertainty added. Level 1 Level 2 Level 3 tumour cancer sarcoma hygroma kyste polykystosis pseudokyst polyp polyposis
    26. 26. Isn’t this a bit slow? • Yes • Optimisation – Scan for potential documents. – Process them intensively. • Evaluation? – Not in that paper.
    27. 27. Not unique • SCISOR – P.S. Jacobs & L.F. Rau (1990) SCISOR: Extracting Information from On-line News, in Communications of the ACM (CACM), Vol. 33, No. 11, 88-97
    28. 28. Why do they work? • Because of the restrictions – Small subject domain. – Limited vocabulary. – Restricted type of question. • Compare with large scale IR system. – Keywords are good enough. – Long time to set up. – Hard to adapt to new domain.
    29. 29. Anything else for NLP? • Text Generation – IR system explaining itself?
    30. 30. Conclusions • By now, you will be able to – Outline a range of attempts to get NLP to work with IR systems – Idly speculate on why they failed – Describe the successful use of NLP in a limited domain