The Role of Natural Language Processing in Information Retrieval


Published on

The Role of Natural Language Processing in Information Retrieval: Searching for Meaning in Text

Published in: Technology, Education
1 Comment
  • I just posted a whitepaper on SlideShare describing the differences between lemmatization and stemming which may be of interest:
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Exercise: experiment with the 3 major search engines (Google, Bing, Yahoo) to see how they handle stop words. Try queries that are dominated by stop words, with and without quotes. How do they differ?
  • Exercise: experiment with the 3 major search engines (Google, Bing, Yahoo) to see how they handle stemming. Try queries that are dominated by inflected forms, with and without quotes. How do they differ?
  • Do as class exercise
  • Demo: ELIZA
  • Demo Google suggest
  • Demo Auto-correct & DYM
  • Demo: try Google Translate (or Babelfish) with sentences containing ambiguous terms or NEs, e.g. “I live in a new house in new york”. Use the speech synthesizer|de|I%20live%20in%20a%20new%20house%20in%20new%20york
  • Demo Yahoo search assist + “explore concepts”
  • Paired exercise: what is the most polysemous English word you can think of? How many senses do you think it has in an average dictionary? Think of some polysemous words (examples include “ball”, “bank”, “port” but there are many others). Use these as queries in a search engine and comment on what you notice about the results.
  • Paired Exercise: try each of above queries in Google, Yahoo, Bing. Now try to answer the same questions with a normal web search (by entering a set of query terms instead of a natural language question). Are the results of this search any better or worse?
  • Paired exercise: try previous queries on these sites
  • The Role of Natural Language Processing in Information Retrieval

    1. 1. The Role of Natural Language Processing in Information RetrievalSearching for Meaning in Text<br />Tony Russell-Rose, PhD<br />21-Mar-2011<br />
    2. 2. Contents<br />Introduction & Overview<br />Background, Terminology<br />NLP Fundamentals<br />Paradigms & Perspectives<br />NLP Tools & Techniques<br />NLP Applications<br />Text mining, Question Answering, etc.<br />Conclusions & Further Reading<br />21-Mar-2011<br />2<br />NLP & IR<br />
    3. 3. Introduction & Overview<br />The Role of Natural Language Processing in Information Retrieval<br />21-Mar-2011<br />3<br />NLP & IR<br />
    4. 4. Information Retrieval<br />Models, techniques & frameworks for satisfying an information need<br />Common paradigm: document retrieval<br />21-Mar-2011<br />4<br />NLP & IR<br />
    5. 5. Document Retrieval<br />Methods include Boolean, vector space, probabilistic<br />Rely on index terms<br />“bag of words”<br />Stop lists + stemming<br />But text is “unstructured”<br />Information may be “hidden”<br />21-Mar-2011<br />5<br />NLP & IR<br />
    6. 6. NLP – a solved problem?<br />As humans we do it effortlessly ... don’t we?<br />DRUNK GETS NINE YEARS IN VIOLIN CASE<br />PROSTITUTES APPEAL TO POPE <br />STOLEN PAINTING FOUND BY TREE <br />RED TAPE HOLDS UP NEW BRIDGE<br />DEER KILL 300,000<br />RESIDENTS CAN DROP OFF TREES<br />INCLUDE CHILDREN WHEN BAKING COOKIES <br />MINERS REFUSE TO WORK AFTER DEATH  <br />21-Mar-2011<br />6<br />NLP & IR<br />
    7. 7. Problems with Text (1)<br />Polysemy<br />One word maps to many concepts<br />e.g. bat<br />Synonymy<br />One concept maps to many words<br />21-Mar-2011<br />7<br />NLP & IR<br />
    8. 8. Problems with Text (2)<br />Word order<br />Venetian blind vs. Blind venetian<br />21-Mar-2011<br />8<br />NLP & IR<br />
    9. 9. Problems with Text (3)<br />Language is generative<br />Starbucks coffee is the best<br />The place I like most when I need to feed my caffeine addiction is the company from Seattle with branches everywhere<br />Many different ways to express a given idea<br />Synonymy, paraphrase, metaphor, etc.<br />21-Mar-2011<br />9<br />NLP & IR<br />
    10. 10. Problems with Text (4)<br /><ul><li>Frege's principle:
    11. 11. The meaning of a sentence is completely determined by the meaning of its symbols and the syntax used to combine them</li></ul>Language is a form of communication<br />All communication has a context:<br />time and place of utterance, the writer, the reader, their backgroundknowledge, intentions, assumptions re the reader’s knowledge/intentions, etc.<br />21-Mar-2011<br />10<br />NLP & IR<br />
    12. 12. Problems with Text (5)<br />Language is changing<br />I want to buy a mobile<br />Ill-formed input<br />“accomodation office”<br />Co-ordination, negation, etc.<br />This is not a talk about neuro-linguistic programming<br />Multi-linguality<br />Claudia Schiffer is on the cover of Elle<br />Also sarcasm, irony, slang, jargon, etc.<br />That was a wicked lecture<br />Yep – the coffee break was the best part<br />21-Mar-2011<br />11<br />NLP & IR<br />
    13. 13. Enter NLP / Text Analytics…<br />Text Analytics: a set of linguistic, analytical and predictive techniques to extract structure and meaning from unstructured documents<br />NLP: the academic term for Text Analytics<br />analogous to “search” vs. “IR”<br />Text Analytics ~= Natural Language Processing ~= Text Mining<br />Text Mining -> Scientific / technical context, automated processing<br />Text Analytics -> Business context, interactive apps<br />21-Mar-2011<br />12<br />NLP & IR<br />
    14. 14. NLP Fundamentals<br />The Role of Natural Language Processing in Information Retrieval<br />21-Mar-2011<br />13<br />NLP & IR<br />
    15. 15. NLP Perspectives<br />Computational Linguistics<br />Use of computational techniques to study linguistic phenomena<br />Cognitive science<br />Study of human information processing (perception, language, reasoning, etc.)<br />Computer science<br />Theoretical foundations of computation and practical techniques for implementation <br />Information science<br />Analysis, classification, manipulation, retrieval and dissemination of information<br />21-Mar-2011<br />14<br />NLP & IR<br />
    16. 16. NLP Paradigms<br />Symbolic approaches<br />Rule-based, hand coded (by linguists)<br />Knowledge-intensive<br />Statistical approaches<br />Shallow statistical models, trained on annotated corpora<br />Data-intensive<br />21-Mar-2011<br />15<br />NLP & IR<br />
    17. 17. NLP Fundamentals (word level)<br />Language is AMBIGUOUS<br />To find structure, we must remove ambiguity!<br />Lexical analysis (tokenisation)<br />The cat sat on the mat<br />I can’t tokenise this sentence<br />Stop word removal<br />No definitive list<br />The Who, The The, Take That…<br />To be or not to be<br />21-Mar-2011<br />16<br />NLP & IR<br />
    18. 18. NLP Fundamentals (word level)<br />Stemming<br />fishing, fished, fish, fisher -> fish<br />Lemmatization<br />Stemming + context<br />Passing -> pass + ING<br />Were -> be + PAST<br />Delegate = de-leg-ate (?)<br />Ratify = rat-ify (?)<br />Morphology (prefixes, suffixes, etc.)<br />Gebäudereinigungsfirmenangestellter -> Gebäude + Reinigung + Firma + Angestellter (building + cleaning + company + employee)<br />21-Mar-2011<br />17<br />NLP & IR<br />
    19. 19. NLP Fundamentals (sentence level)<br />Syntax (part of speech tagging)<br />book -> NOUN, VERB<br />that -> DETERMINER<br />flight -> NOUN<br />Book that flight -> VB DT NN<br />Ambiguity problem<br />Time flies like an arrow -> ?<br />Fruit flies like a banana -> ?<br />Eats shoots and leaves -> ?<br />21-Mar-2011<br />18<br />NLP & IR<br />
    20. 20. NLP Fundamentals (sentence level)<br />Parsing (grammar)<br />I saw a venetian blind<br />I saw a blind venetian<br />I saw the man on the hill with a telescope<br />Rugby is a game played by men with odd-shaped balls<br />Sentence boundary detection<br />Punctuation denotes the end of a sentence!<br />“But not always!”, said Fred...<br />21-Mar-2011<br />19<br />NLP & IR<br />
    21. 21. NLP Tools & Techniques<br />The Role of Natural Language Processing in Information Retrieval<br />21-Mar-2011<br />20<br />NLP & IR<br />
    22. 22. Early NLP Systems<br />ELIZA<br />Wiezenbaum 1966<br />Simple pattern matching<br />PARRY<br />Colby et al 1971<br />Pattern matching & planning<br />SHRDLU<br />Winograd 1972<br />Natural language understanding<br />Comprehensive grammar of English<br />21-Mar-2011<br />21<br />NLP & IR<br />
    23. 23. Word Prediction<br />Assistive technologies<br />Aurora Systems, TextHelp<br />Google, Bing, Yahoo query suggestions<br />21-Mar-2011<br />22<br />NLP & IR<br />
    24. 24. Spelling Correction<br />Auto-correct<br />Did You Mean<br />21-Mar-2011<br />23<br />NLP & IR<br />
    25. 25. Text Categorization<br />News agencies:<br />classify incoming news stories<br />Search engines:<br />classify queries, e.g. search Google for ‘LOG313’<br />Identifying spam emails, e.g.<br /><br />Routing email or documents to appropriate people<br />21-Mar-2011<br />24<br />NLP & IR<br />
    26. 26. Terminology Extraction<br />Differentiate between useful index terms and ‘noise’<br />Help lexicographers identify new terminology:<br />the C4 carbons of the nicotinamide rings<br />X-ray therapy vs. X-ray therapies<br />PAHO vs. Pan-American Health Organization<br />Term extraction systems process scientific papers to identify terminology, possibly comparing it with a known list<br />21-Mar-2011<br />25<br />NLP & IR<br />
    27. 27. Speech Recognition<br />Spoken Dialogue Systems<br />Directory enquiries (AT&T)<br />Railways (Germany, Switzerland, Italy)<br />Airlines (USA)<br />iPhone Voice Search<br />IBM Watson<br />21-Mar-2011<br />26<br />NLP & IR<br />
    28. 28. Named Entity Recognition<br />Identification of key concepts, <br />e.g. people, places, organisations, etc.<br />Also postcodes, temporal/numerical expressions, etc.<br />“Mexico has been trying to stage a recovery since the beginning of this year and it's always been getting ahead of itself in terms of fundamentals,” said Matthew Hickman of Lehman Brothers in New York.”<br />21-Mar-2011<br />27<br />NLP & IR<br />
    29. 29. Named Entity Recognition: Applications<br />Increase precision of IR<br />New companies in York vs. Companies in New York<br />Support navigation<br />SMART tags, Apture, etc.<br />Improve machine translation:<br />“I live in a new house in new york”<br />Speech synthesis, auto-summarisation, etc.<br />21-Mar-2011<br />28<br />NLP & IR<br />
    30. 30. Named Entity Recognition<br />Systems usually developed for specific domain<br />Are typically language and domain-specific<br />Accuracy > 90% for newswire text<br />Best system at MUC-7 scored 93.39% f-measure<br />human annotators scored 97.60% and 96.95%.<br />Lower for certain domains, e.g. life sciences (species, substances, proteins, genes, etc.) <br />extended naming conventions, extensive use of conjunction and disjunction, high occurrence of neologisms and abbreviations, etc.<br />21-Mar-2011<br />29<br />NLP & IR<br />
    31. 31. NER (Newssift Example)<br />21-Mar-2011<br />30<br />NLP & IR<br />
    32. 32. Machine Translation (before)<br />21-Mar-2011<br />31<br />NLP & IR<br />
    33. 33. Machine Translation (after)<br />21-Mar-2011<br />32<br />NLP & IR<br />
    34. 34. Summarisation<br />Section 3 discusses two information access applications (text mining and question answering) closely associated with NLP. <br />NLP Techniques<br />Named entity recognition <br />Information Extraction <br />Current document retrieval technologies could not identify information as specific as this within text. <br />Word Sense Disambiguation<br />Text Mining<br />Question Answering <br />“Wh” questions<br />List questions <br />Examples include those mentioned here: text mining, question answering and cross-language information retrieval. <br />Ponte and Croft (1998) “A Language Modeling Approach to Information Retrieval” SIGIR.<br /> <br />Sanderson, M. (1994) “Word sense disambiguation and information retrieval” Proceedings of the 17th ACM SIGIR Conference<br /> <br />Sanderson, M. (2000) “Retrieving with Good Sense” Information Retrieval 2(1):49-69.<br /> <br />Question Answering (Section 3.2).<br /> <br />Summarization<br />single-document vs. multi-document<br />MS Word<br />Bing<br />21-Mar-2011<br />33<br />NLP & IR<br />
    35. 35. Information Extraction<br /><ul><li>Identification of entities + relationships
    36. 36. Based on pre-defined structures, e.g.</li></ul>movements of company executives<br />victims of terrorist attacks<br />corporate mergers / acquisitions<br />interactions between genes and proteins<br />Example (from MUC-6):<br />Now, Mr. James is preparing to sail into the sunset, and Mr. Dooner is poised to rev up the engines to guide Interpublic Group's McCann-Erickson into the 21st century. Yesterday, McCann made official what had been widely anticipated: Mr. James, 57 years old, is stepping down as chief executive officer on July 1 and will retire as chairman at the end of the year. He will be succeeded by Mr. Dooner, 45.<br />21-Mar-2011<br />34<br />NLP & IR<br />
    37. 37. Information Extraction<br /><SUCCESSION_EVENT-2> :=<br /> SUCCESSION_ORG: <br /> <ORGANIZATION-1><br /> POST: "chairman"<br /> IN_AND_OUT: <IN_AND_OUT-4><br /> VACANCY_REASON: DEPART_WORKFORCE<br /> <br /><IN_AND_OUT-4> :=<br /> IO_PERSON: <PERSON-1><br /> NEW_STATUS: IN<br /> ON_THE_JOB: NO<br /> OTHER_ORG: <ORGANIZATION-1><br /> REL_OTHER_ORG: SAME_ORG<br /> <br /><ORGANIZATION-1> :=<br /> ORG_NAME: "McCann-Erickson"<br /> ORG_ALIAS: "McCann"<br /> ORG_TYPE: COMPANY<br /> <br /><PERSON-1> :=<br /> PER_NAME: "John J. Dooner Jr."<br /> PER_ALIAS: "John Dooner" "Dooner"<br />21-Mar-2011<br />35<br />NLP & IR<br />
    38. 38. Information Extraction<br />Can now use as metadata (for retrieval)<br />Or store in DB and query against it, e.g.<br />“show me all the events where a finance officer left a company to take up the position of CEO in another company”<br />21-Mar-2011<br />36<br />NLP & IR<br />
    39. 39. IE Example: Yahoo Correlator<br />21-Mar-2011<br />37<br />NLP & IR<br />
    40. 40. IE Example: Google Wonder Wheel<br />21-Mar-2011<br />38<br />NLP & IR<br />
    41. 41. IE Example: Yahoo Search Assist<br />21-Mar-2011<br />39<br />NLP & IR<br />
    42. 42. Information Extraction<br />But IE is difficult when concepts are spread across multiple sentences:<br />Pace American Group Inc. said it notified two top executives it intends to dismiss them because an internal investigation found evidence of “self-dealing” and “undisclosed financial relationships.” The executives are Don H. Pace, president and CEO; and Greg S. Kaplan, SVP and CFO.<br />Event & org in first S, but names in second S<br />Need to identify connection between “two top executives” and “the executives”.<br />21-Mar-2011<br />40<br />NLP & IR<br />
    43. 43. Information Extraction<br />Anaphora resolution relies on context (semantic / pragmatic knowledge):<br />We gave the bananas to the monkeys because they were hungry.<br />We gave the bananas to the monkeys because they were ripe.<br />We gave the bananas to the monkeys because they were here. <br />21-Mar-2011<br />41<br />NLP & IR<br />
    44. 44. WordNet<br />Semantic database for English <br />See also EuroWordNet<br />Based on synsets<br />car = automobile | railway car | elevator car | cable car<br />1stsynset = car, motorcar, machine, auto, automobile <br />Connected via relationships<br />E.g. hypernymy (a kind of)<br />separate hierarchies for nouns, verbs, adjectives and adverbs<br />21-Mar-2011<br />42<br />NLP & IR<br />
    45. 45. Word Sense Disambiguation<br />Resolve ambiguities due to polysemy<br />Often characterised by syntax, e.g.<br />Magnesium is a light metal (ADJ)<br />The light in the kitchen is quite bright (NOUN)<br />Can use a Part of Speech Tagger to resolve broad ambiguities<br />PoS tags = NN, NNP, NNS, NNPS, etc.<br />21-Mar-2011<br />43<br />NLP & IR<br />
    46. 46. Dictionary Definition of “set”<br />set<br />   /sɛt/ Show Spelled [set] Show IPA verb, set, set·ting, noun, adjective, interjection <br />–verb (used with object) 1. to put (something or someone) in a particular place: to set a vase on a table. <br />2. to place in a particular position or posture: Set thebaby on his feet. <br />3. to place in some relation to something or someone: We set a supervisor over the new workers. <br />4. to put into some condition: to set a house on fire. <br />5. to put or apply: to set fire to a house. <br />6. to put in the proper position: to set a chair back on its feet. <br />7. to put in the proper or desired order or condition for use: to set a trap. <br />8. to distribute or arrange china, silver, etc., for use on (a table): to set the table for dinner. <br />9. to place (the hair, especially when wet) on rollers, in clips, or the like, so that the hair will assume a particular style. <br />10. to put (a price or value) upon something: He set $7500 as the right amount for the car. The teacher sets a high value on neatness. <br />11. to fix the value of at a certain amount or rate; value: He set the car at $500. She sets neatness at a high value. <br />12. to post, station, or appoint for the purpose of performing some duty: to set spies on a person. <br />13. to determine or fix definitely: to set a time limit. <br />14. to resolve or decide upon: to set a wedding date. <br />15. to cause to pass into a given state or condition: to set one's mind at rest; to set a prisoner free. <br />16. to direct or settle resolutely or wishfully: to set one's mind to a task. <br />17. to present as a model; place before others as a standard: to set a good example. <br />18. to establish for others to follow: to set a fast pace. <br />19. to prescribe or assign, as a task. <br />20. to adjust (a mechanism) so as to control its performance. <br /><ul><li>21. to adjust the hands of (a clock or watch) according to a certain standard: I always set my watch by the clock in the library.
    47. 47. 22. to adjust (a timer, alarm of a clock, etc.) so as to sound when desired: He set the alarm for seven o'clock.
    48. 48. 23. to fix or mount (a gem or the like) in a frame or setting.
    49. 49. 24. to ornament or stud with gems or the like: a bracelet set with pearls.
    50. 50. 25. to cause to sit; seat: to set a child in a highchair.
    51. 51. 26. to put (a hen) on eggs to hatch them.
    52. 52. 27. to place (eggs) under a hen or in an incubator for hatching.
    53. 53. 28. to place or plant firmly: to set a flagpole in concrete.
    54. 54. 29. to put into a fixed, rigid, or settled state, as the face, muscles, etc.
    55. 55. 30. to fix at a given point or calibration: to set the dial on an oven; to set a micrometer.
    56. 56. 31. to tighten (often followed by up ): to set nuts well up.
    57. 57. 32. to cause to take a particular direction: to set one's course to the south.
    58. 58. 33. Surgery . to put (a broken or dislocated bone) back in position.
    59. 59. 34. (of a hunting dog) to indicate the position of (game) by standing stiffly and pointing with the muzzle.
    60. 60. 35. Music . a. to fit, as words to music.
    61. 61. b. to arrange for musical performance.
    62. 62. c. to arrange (music) for certain voices or instruments.
    63. 63. 36. Theater . a. to arrange the scenery, properties, lights, etc., on (a stage) for an act or scene.
    64. 64. b. to prepare (a scene) for dramatic performance.
    65. 65. 37. Nautical . to spread and secure (a sail) so as to catch the wind.
    66. 66. 38. Printing . a. to arrange (type) in the order required for printing.
    67. 67. b. to put together types corresponding to (copy); compose in type: to set an article.
    68. 68. 39. Baking . to put aside (a substance to which yeast has been added) in order that it may rise.
    69. 69. 40. to change into curd: to set milk with rennet.
    70. 70. 41. to cause (glue, mortar, or the like) to become fixed or hard.
    71. 71. 42. to urge, goad, or encourage to attack: to set the hounds on a trespasser.
    72. 72. 43. Bridge . to cause (the opposing partnership or their contract) to fall short: We set them two tricks at four spades. Only perfect defense could set four spades.
    73. 73. 44. to affix or apply, as by stamping: The king set his seal to the decree.
    74. 74. 45. to fix or engage (a fishhook) firmly into the jaws of a fish by pulling hard on the line once the fish has taken the bait.
    75. 75. 46. to sharpen or put a keen edge on (a blade, knife, razor, etc.) by honing or grinding.
    76. 76. 47. to fix the length, width, and shape of (yarn, fabric, etc.).
    77. 77. 48. Carpentry . to sink (a nail head) with a nail set.
    78. 78. 49. to bend or form to the proper shape, as a saw tooth or a spring.
    79. 79. 50. to bend the teeth of (a saw) outward from the blade alternately on both sides in order to make a cut wider than the blade itself.
    80. 80. –verb (used without object) 51. to pass below the horizon; sink: The sun sets early in winter.
    81. 81. 52. to decline; wane.
    82. 82. 53. to assume a fixed or rigid state, as the countenance or the muscles.
    83. 83. 54. (of the hair) to be placed temporarily on rollers, in clips, or the like, in order to assume a particular style: Long hair sets more easily than short hair.
    84. 84. 55. to become firm, solid, or permanent, as mortar, glue, cement, or a dye, due to drying or physical or chemical change.
    85. 85. 56. to sit on eggs to hatch them, as a hen.
    86. 86. 57. to hang or fit, as clothes.
    87. 87. 58. to begin to move; start (usually followed by forth, out, off,  etc.).
    88. 88. 59. (of a flower's ovary) to develop into a fruit.
    89. 89. 60. (of a hunting dog) to indicate the position of game. </li></ul>21-Mar-2011<br />44<br />NLP & IR<br />
    90. 90. Dictionary Definition of “set”<br />61. to have a certain direction or course, as a wind, current, or the like. <br />62. Nautical . (of a sail) to be spread so as to catch the wind. <br />63. Printing . (of type) to occupy a certain width: This copy sets to forty picas. <br />64. Nonstandard . sit: Come in and set a spell. <br />–noun 65. the act or state of setting or the state of being set. <br />66. a collection of articles designed for use together: a set of china; a chess set. <br />67. a collection, each member of which is adapted for a special use in a particular operation: a set of golf clubs; a set of carving knives. <br />68. a number, group, or combination of things of similar nature, design, or function: a set of ideas. <br />69. a series of volumes by one author, about one subject, etc. <br />70. a number, company, or group of persons associated by common interests, occupations, conventions, or status: a set of murderous thieves; the smart set. <br />71. the fit, as of an article of clothing: the set of his coat. <br />72. fixed direction, bent, or inclination: The set of his mind was obvious. <br />73. bearing or carriage: the set of one's shoulders. <br />74. the assumption of a fixed, rigid, or hard state, as by mortar or glue. <br />75. the arrangement of the hair in a particular style: How much does the beauty parlor charge for a shampoo and set? <br /><ul><li>76. a plate for holding a tool or die.
    91. 91. 77. an apparatus for receiving radio or television programs; receiver.
    92. 92. 78. Philately . a group of stamps that form a complete series.
    93. 93. 79. Tennis . a unit of a match, consisting of a group of not fewer than sixgames with a margin of at least two games between the winner and loser: He won the match in straight sets of 6–3, 6–4, 6–4.
    94. 94. 80. a construction representing a place or scene in which the action takes place in a stage, motion-picture, or television production.
    95. 95. 81. Machinery . a. the bending out of the points of alternate teeth of a saw in opposite directions.
    96. 96. b. a permanent deformation or displacement of an object or part.
    97. 97. c. a tool for giving a certain form to something, as a saw tooth.
    98. 98. 82. a chisel having a wide blade for dividing bricks.
    99. 99. 83. Horticulture . a young plant, or a slip, tuber, or the like, suitable for planting.
    100. 100. 84. Dance . a. the number of couples required to execute a quadrille or the like.
    101. 101. b. a series of movements or figures that make up a quadrille or the like.
    102. 102. 85. Music . a. a group of pieces played by a band, as in a night club, and followed by an intermission.
    103. 103. b. the period during which these pieces are played.
    104. 104. 86. Bridge . a failure to take the number of tricks specified by one's contract: Our being vulnerable made the set even more costly.
    105. 105. 87. Nautical . a. the direction of a wind, current, etc.
    106. 106. b. the form or arrangement of the sails, spars, etc., of a vessel.
    107. 107. c. suit ( def. 12 ) .
    108. 108. 88. Psychology . a temporary state of an organism characterized by a readiness to respond to certain stimuli in a specific way.
    109. 109. 89. Mining . a timber frame bracing or supporting the walls or roof of a shaft or stope.
    110. 110. 90. Carpentry . nail set.
    111. 111. 90. Carpentry . nail set.
    112. 112. 91. Mathematics . a collection of objects or elements classed together.
    113. 113. 92. Printing . the width of a body of type.
    114. 114. 93. sett ( def. 3 ) .
    115. 115. –adjective 94. fixed or prescribed beforehand: a set time; set rules.
    116. 116. 95. specified; fixed: The hall holds a set number of people.
    117. 117. 96. deliberately composed; customary: set phrases.
    118. 118. 97. fixed; rigid: a set smile.
    119. 119. 98. resolved or determined; habitually or stubbornly fixed: to be set in one's opinions.
    120. 120. 99. completely prepared; ready: Is everyone set?
    121. 121. –interjection 100. (in calling the start of a race): Ready! Set! Go! Also, get set! </li></ul>21-Mar-2011<br />45<br />NLP & IR<br />
    122. 122. Word Sense Disambiguation<br />Surely PoS tagging would help IR performance?<br />Could index docs using words AND PoS tags<br />Very sensitive to tagger performance<br />Does WSD improve IR performance?<br />Conflicting evidence …“Perfect” WSD engine may provide only marginal gain, e.g.<br />>76% accuracy needed for homonymy<br />>55% for polysemy<br />21-Mar-2011<br />46<br />NLP & IR<br />
    123. 123. NLP Applications<br />The Role of Natural Language Processing in Information Retrieval<br />21-Mar-2011<br />47<br />NLP & IR<br />
    124. 124. NLP Applications<br />Web & enterprise search<br />Text classification<br />Text summarisation<br />Machine translation<br />Human-computer interfaces (spelling correction etc.)<br />Speech recognition & synthesis<br />Natural language generation (e.g. in games)<br />Text mining<br />Question answering<br />21-Mar-2011<br />48<br />NLP & IR<br />
    125. 125. Finally becoming mainstream?<br />‘80% of corporate information is unstructured’<br />Entire value chain for some organisation (media / publishing etc.)<br />Retail / eCommerce: Product reviews <br />User generated content: blogs, forums, wikis<br />Voice of the Customer: social media + sentiment analysis<br />161 billion gigabytes of digital information in 2006<br />approximately 988 exabytes by 2010 <br />Audio / video still needs summaries & tags etc.<br />21-Mar-2011<br />49<br />NLP & IR<br />
    126. 126. Market Outlook<br />Combine component technologies<br />e.g. PoS tagging + NER, etc.<br />Healthy growth: massive expansion in social media<br />Lightweight NLP for buzz analysis, brand monitoring, etc.<br />Dominant markets:<br />Customer Experience (VoC), media/publishing, financial services & insurance, intelligence, life sciences, e-discovery<br />Solutions still not standardized<br />Need for self-service tuning & configuration<br />Partner ecosystem developing<br />Marketing services providers, platform vendors, CRM + call centre vendors, system integrators<br />21-Mar-2011<br />50<br />NLP & IR<br />
    127. 127. Text Mining<br />Analogy with Data Mining<br />Discover or infer new knowledge from unstructured text resources<br />A <-> B and B <-> C<br />Infer A <-> C?<br />e.g. link between migraine headaches and magnesium deficiency<br />Applications in life sciences, media/publishing, counter terrorism and competitive intelligence<br />21-Mar-2011<br />51<br />NLP & IR<br />
    128. 128. Text Mining<br />Levels of text mining, e.g. for bioscience:<br />Tagging entities: e.g. genes, proteins, diseases and chemical compounds<br />Information extraction: identify meaningful structural relations within the text (e.g. interactions between proteins)<br />Knowledge discovery: combine with other knowledge sources (such as external ontologies) to infer or discover new knowledge<br />21-Mar-2011<br />52<br />NLP & IR<br />
    129. 129. Question Answering<br />Going beyond the document retrieval paradigm<br />provide specific answers to specific questions<br />Introduced as a TREC track in 1999<br />21-Mar-2011<br />53<br />NLP & IR<br />
    130. 130. Question Answering<br />Yes/no questions <br />“Is George W. Bush the current president of the USA?”<br />“Is the Sea of Tranquility deep?”<br />“Wh” questions<br />“Who was the British Prime Minister before Margaret Thatcher?”<br />“When was the Battle of Hastings?”<br />List questions <br />“Which football teams have won the Champions League this decade?”<br />“Which roads lead to Rome?”<br />Instruction-based questions <br />“How do I cook lasagne?”,<br />“What is the best way to build a bridge?”<br />Explanation questions <br />“Why did World War I start?”<br />“How does a computer process floating point numbers?”<br />Commands <br />“Tell me the height of the Eiffel Tower.”<br />“Name all the Kings of England”<br />21-Mar-2011<br />54<br />NLP & IR<br />
    131. 131. Question Answering<br />Common use cases handled by major search engines<br />Google, Yahoo!, Bing, Baidu, ...<br />Specialist QA systems<br />Wolfram Alpha:<br />True Knowledge:<br />Powerset:<br />IBM Watson<br />21-Mar-2011<br />55<br />NLP & IR<br />
    132. 132. Question Answering<br />Basic approach:<br />question analysis<br />Predict type of answer expected<br />Create query<br />document retrieval<br />answer extraction<br />Based on output of (1)<br />Varies from regex to deep linguistic analysis<br />21-Mar-2011<br />56<br />NLP & IR<br />
    133. 133. Evaluation<br />How do we measure NLP accuracy?<br />Compare against human annotation<br />However:<br />Humans often disagree<br />Particularly for complex tasks<br />Comparison can be measured in many ways<br />“Bill Gates is chairman of Microsoft” -> “Gates” = person?<br />21-Mar-2011<br />57<br />NLP & IR<br />
    134. 134. Conclusions<br />The Role of Natural Language Processing in Information Retrieval<br />21-Mar-2011<br />58<br />NLP & IR<br />
    135. 135. Conclusions<br />NLP’s contribution to IR not always apparent<br />Overemphasis on precision & recall?<br />NLP provides benefits that go beyond quantitative metrics to support broader tasks & richer experiences<br />Insight, analysis, visualisation, etc.<br />NLP required for text mining, question answering & cross-language IR<br />Bag of words model is insufficient<br />Web becoming increasingly multi-lingual<br />21-Mar-2011<br />59<br />NLP & IR<br />
    136. 136. State of the Art<br />Commodity technologies:<br />Tokenisation, normalization, regular expression search<br />Mainstream, but room for improvement:<br />Lemmatization, spelling correction, speech synthesis<br />Mainstream but needs improvement:<br />PoS tagging, term extraction, summarization, speech recognition, text categorization, spoken dialogue systems<br />Almost there:<br />Information extraction, NL generation, simple speech translation<br />Don’t hold your breath:<br />Full machine translation, advanced dialogue systems<br />21-Mar-2011<br />60<br />NLP & IR<br />
    137. 137. Platforms & Toolkits<br />Many commercialvendors<br />Open source/access:<br />GATE (Sheffield University)<br />Open source, linguistic focus<br />RapidMiner<br />OpenNLP<br />OpenCalais<br />LingPipe<br />nltk<br />21-Mar-2011<br />61<br />NLP & IR<br />
    138. 138. Further Reading<br />A. Goker & J. Davies, Information Retrieval: Searching in the 21st Century. Wiley-Blackwell, 2009<br />D. Jurafsky and J. Martin, Speech and Language Processing. Prentice-Hall, 2nd edition, 2008.<br />Manning, C. and Schutze, H. Foundations of Statistical Language Processing. MIT Press, 1999. <br />21-Mar-2011<br />62<br />NLP & IR<br />
    139. 139. Thank you <br /> Tony Russell-Rose, PhD<br /> Vice-chair, BCS IRSG<br /> Chair, IEHF HCI Group<br />Email:<br />Blog:<br />LinkedIn:<br />Twitter: @tonygrr<br />