Your SlideShare is downloading. ×
The Role of Natural Language Processing in Information Retrieval
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The Role of Natural Language Processing in Information Retrieval

10,467

Published on

The Role of Natural Language Processing in Information Retrieval: Searching for Meaning in Text

The Role of Natural Language Processing in Information Retrieval: Searching for Meaning in Text

Published in: Technology, Education
1 Comment
51 Likes
Statistics
Notes
  • I just posted a whitepaper on SlideShare describing the differences between lemmatization and stemming which may be of interest:

    http://www.slideshare.net/andrew_paulsen/high-quality-search-european-languages
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
10,467
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
31
Comments
1
Likes
51
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Exercise: experiment with the 3 major search engines (Google, Bing, Yahoo) to see how they handle stop words. Try queries that are dominated by stop words, with and without quotes. How do they differ?
  • Exercise: experiment with the 3 major search engines (Google, Bing, Yahoo) to see how they handle stemming. Try queries that are dominated by inflected forms, with and without quotes. How do they differ?
  • Do as class exercise
  • Demo: ELIZA http://nlp-addiction.com/eliza/
  • Demo Google suggest
  • Demo Auto-correct & DYM
  • Demo: try Google Translate (or Babelfish) with sentences containing ambiguous terms or NEs, e.g. “I live in a new house in new york”. Use the speech synthesizerhttp://translate.google.com/#en|de|I%20live%20in%20a%20new%20house%20in%20new%20york
  • Demo Yahoo search assist + “explore concepts”
  • Paired exercise: what is the most polysemous English word you can think of? How many senses do you think it has in an average dictionary? Think of some polysemous words (examples include “ball”, “bank”, “port” but there are many others). Use these as queries in a search engine and comment on what you notice about the results.
  • Paired Exercise: try each of above queries in Google, Yahoo, Bing. Now try to answer the same questions with a normal web search (by entering a set of query terms instead of a natural language question). Are the results of this search any better or worse?
  • Paired exercise: try previous queries on these sites
  • Transcript

    • 1. The Role of Natural Language Processing in Information RetrievalSearching for Meaning in Text
      Tony Russell-Rose, PhD
      21-Mar-2011
    • 2. Contents
      Introduction & Overview
      Background, Terminology
      NLP Fundamentals
      Paradigms & Perspectives
      NLP Tools & Techniques
      NLP Applications
      Text mining, Question Answering, etc.
      Conclusions & Further Reading
      21-Mar-2011
      2
      NLP & IR
    • 3. Introduction & Overview
      The Role of Natural Language Processing in Information Retrieval
      21-Mar-2011
      3
      NLP & IR
    • 4. Information Retrieval
      Models, techniques & frameworks for satisfying an information need
      Common paradigm: document retrieval
      21-Mar-2011
      4
      NLP & IR
    • 5. Document Retrieval
      Methods include Boolean, vector space, probabilistic
      Rely on index terms
      “bag of words”
      Stop lists + stemming
      But text is “unstructured”
      Information may be “hidden”
      21-Mar-2011
      5
      NLP & IR
    • 6. NLP – a solved problem?
      As humans we do it effortlessly ... don’t we?
      DRUNK GETS NINE YEARS IN VIOLIN CASE
      PROSTITUTES APPEAL TO POPE
      STOLEN PAINTING FOUND BY TREE
      RED TAPE HOLDS UP NEW BRIDGE
      DEER KILL 300,000
      RESIDENTS CAN DROP OFF TREES
      INCLUDE CHILDREN WHEN BAKING COOKIES
      MINERS REFUSE TO WORK AFTER DEATH  
      21-Mar-2011
      6
      NLP & IR
    • 7. Problems with Text (1)
      Polysemy
      One word maps to many concepts
      e.g. bat
      Synonymy
      One concept maps to many words
      21-Mar-2011
      7
      NLP & IR
    • 8. Problems with Text (2)
      Word order
      Venetian blind vs. Blind venetian
      21-Mar-2011
      8
      NLP & IR
    • 9. Problems with Text (3)
      Language is generative
      Starbucks coffee is the best
      The place I like most when I need to feed my caffeine addiction is the company from Seattle with branches everywhere
      Many different ways to express a given idea
      Synonymy, paraphrase, metaphor, etc.
      21-Mar-2011
      9
      NLP & IR
    • 10. Problems with Text (4)
      • Frege's principle:
      • 11. The meaning of a sentence is completely determined by the meaning of its symbols and the syntax used to combine them
      Language is a form of communication
      All communication has a context:
      time and place of utterance, the writer, the reader, their backgroundknowledge, intentions, assumptions re the reader’s knowledge/intentions, etc.
      21-Mar-2011
      10
      NLP & IR
    • 12. Problems with Text (5)
      Language is changing
      I want to buy a mobile
      Ill-formed input
      “accomodation office”
      Co-ordination, negation, etc.
      This is not a talk about neuro-linguistic programming
      Multi-linguality
      Claudia Schiffer is on the cover of Elle
      Also sarcasm, irony, slang, jargon, etc.
      That was a wicked lecture
      Yep – the coffee break was the best part
      21-Mar-2011
      11
      NLP & IR
    • 13. Enter NLP / Text Analytics…
      Text Analytics: a set of linguistic, analytical and predictive techniques to extract structure and meaning from unstructured documents
      NLP: the academic term for Text Analytics
      analogous to “search” vs. “IR”
      Text Analytics ~= Natural Language Processing ~= Text Mining
      Text Mining -> Scientific / technical context, automated processing
      Text Analytics -> Business context, interactive apps
      21-Mar-2011
      12
      NLP & IR
    • 14. NLP Fundamentals
      The Role of Natural Language Processing in Information Retrieval
      21-Mar-2011
      13
      NLP & IR
    • 15. NLP Perspectives
      Computational Linguistics
      Use of computational techniques to study linguistic phenomena
      Cognitive science
      Study of human information processing (perception, language, reasoning, etc.)
      Computer science
      Theoretical foundations of computation and practical techniques for implementation
      Information science
      Analysis, classification, manipulation, retrieval and dissemination of information
      21-Mar-2011
      14
      NLP & IR
    • 16. NLP Paradigms
      Symbolic approaches
      Rule-based, hand coded (by linguists)
      Knowledge-intensive
      Statistical approaches
      Shallow statistical models, trained on annotated corpora
      Data-intensive
      21-Mar-2011
      15
      NLP & IR
    • 17. NLP Fundamentals (word level)
      Language is AMBIGUOUS
      To find structure, we must remove ambiguity!
      Lexical analysis (tokenisation)
      The cat sat on the mat
      I can’t tokenise this sentence
      Stop word removal
      No definitive list
      The Who, The The, Take That…
      To be or not to be
      21-Mar-2011
      16
      NLP & IR
    • 18. NLP Fundamentals (word level)
      Stemming
      fishing, fished, fish, fisher -> fish
      Lemmatization
      Stemming + context
      Passing -> pass + ING
      Were -> be + PAST
      Delegate = de-leg-ate (?)
      Ratify = rat-ify (?)
      Morphology (prefixes, suffixes, etc.)
      Gebäudereinigungsfirmenangestellter -> Gebäude + Reinigung + Firma + Angestellter (building + cleaning + company + employee)
      21-Mar-2011
      17
      NLP & IR
    • 19. NLP Fundamentals (sentence level)
      Syntax (part of speech tagging)
      book -> NOUN, VERB
      that -> DETERMINER
      flight -> NOUN
      Book that flight -> VB DT NN
      Ambiguity problem
      Time flies like an arrow -> ?
      Fruit flies like a banana -> ?
      Eats shoots and leaves -> ?
      21-Mar-2011
      18
      NLP & IR
    • 20. NLP Fundamentals (sentence level)
      Parsing (grammar)
      I saw a venetian blind
      I saw a blind venetian
      I saw the man on the hill with a telescope
      Rugby is a game played by men with odd-shaped balls
      Sentence boundary detection
      Punctuation denotes the end of a sentence!
      “But not always!”, said Fred...
      21-Mar-2011
      19
      NLP & IR
    • 21. NLP Tools & Techniques
      The Role of Natural Language Processing in Information Retrieval
      21-Mar-2011
      20
      NLP & IR
    • 22. Early NLP Systems
      ELIZA
      Wiezenbaum 1966
      Simple pattern matching
      PARRY
      Colby et al 1971
      Pattern matching & planning
      SHRDLU
      Winograd 1972
      Natural language understanding
      Comprehensive grammar of English
      21-Mar-2011
      21
      NLP & IR
    • 23. Word Prediction
      Assistive technologies
      Aurora Systems, TextHelp
      Google, Bing, Yahoo query suggestions
      21-Mar-2011
      22
      NLP & IR
    • 24. Spelling Correction
      Auto-correct
      Did You Mean
      21-Mar-2011
      23
      NLP & IR
    • 25. Text Categorization
      News agencies:
      classify incoming news stories
      Search engines:
      classify queries, e.g. search Google for ‘LOG313’
      Identifying spam emails, e.g.
      http://www.paulgraham.com/spam.html
      Routing email or documents to appropriate people
      21-Mar-2011
      24
      NLP & IR
    • 26. Terminology Extraction
      Differentiate between useful index terms and ‘noise’
      Help lexicographers identify new terminology:
      the C4 carbons of the nicotinamide rings
      X-ray therapy vs. X-ray therapies
      PAHO vs. Pan-American Health Organization
      Term extraction systems process scientific papers to identify terminology, possibly comparing it with a known list
      21-Mar-2011
      25
      NLP & IR
    • 27. Speech Recognition
      Spoken Dialogue Systems
      Directory enquiries (AT&T)
      Railways (Germany, Switzerland, Italy)
      Airlines (USA)
      iPhone Voice Search
      IBM Watson
      21-Mar-2011
      26
      NLP & IR
    • 28. Named Entity Recognition
      Identification of key concepts,
      e.g. people, places, organisations, etc.
      Also postcodes, temporal/numerical expressions, etc.
      “Mexico has been trying to stage a recovery since the beginning of this year and it's always been getting ahead of itself in terms of fundamentals,” said Matthew Hickman of Lehman Brothers in New York.”
      21-Mar-2011
      27
      NLP & IR
    • 29. Named Entity Recognition: Applications
      Increase precision of IR
      New companies in York vs. Companies in New York
      Support navigation
      SMART tags, Apture, etc.
      Improve machine translation:
      “I live in a new house in new york”
      Speech synthesis, auto-summarisation, etc.
      21-Mar-2011
      28
      NLP & IR
    • 30. Named Entity Recognition
      Systems usually developed for specific domain
      Are typically language and domain-specific
      Accuracy > 90% for newswire text
      Best system at MUC-7 scored 93.39% f-measure
      human annotators scored 97.60% and 96.95%.
      Lower for certain domains, e.g. life sciences (species, substances, proteins, genes, etc.)
      extended naming conventions, extensive use of conjunction and disjunction, high occurrence of neologisms and abbreviations, etc.
      21-Mar-2011
      29
      NLP & IR
    • 31. NER (Newssift Example)
      21-Mar-2011
      30
      NLP & IR
    • 32. Machine Translation (before)
      21-Mar-2011
      31
      NLP & IR
    • 33. Machine Translation (after)
      21-Mar-2011
      32
      NLP & IR
    • 34. Summarisation
      Section 3 discusses two information access applications (text mining and question answering) closely associated with NLP.
      NLP Techniques
      Named entity recognition
      Information Extraction
      Current document retrieval technologies could not identify information as specific as this within text.
      Word Sense Disambiguation
      Text Mining
      Question Answering
      “Wh” questions
      List questions
      Examples include those mentioned here: text mining, question answering and cross-language information retrieval.
      Ponte and Croft (1998) “A Language Modeling Approach to Information Retrieval” SIGIR.
       
      Sanderson, M. (1994) “Word sense disambiguation and information retrieval” Proceedings of the 17th ACM SIGIR Conference
       
      Sanderson, M. (2000) “Retrieving with Good Sense” Information Retrieval 2(1):49-69.
       
      Question Answering (Section 3.2).
       
      Summarization
      single-document vs. multi-document
      MS Word
      Bing
      21-Mar-2011
      33
      NLP & IR
    • 35. Information Extraction
      • Identification of entities + relationships
      • 36. Based on pre-defined structures, e.g.
      movements of company executives
      victims of terrorist attacks
      corporate mergers / acquisitions
      interactions between genes and proteins
      Example (from MUC-6):
      Now, Mr. James is preparing to sail into the sunset, and Mr. Dooner is poised to rev up the engines to guide Interpublic Group's McCann-Erickson into the 21st century. Yesterday, McCann made official what had been widely anticipated: Mr. James, 57 years old, is stepping down as chief executive officer on July 1 and will retire as chairman at the end of the year. He will be succeeded by Mr. Dooner, 45.
      21-Mar-2011
      34
      NLP & IR
    • 37. Information Extraction
      <SUCCESSION_EVENT-2> :=
      SUCCESSION_ORG:
      <ORGANIZATION-1>
      POST: "chairman"
      IN_AND_OUT: <IN_AND_OUT-4>
      VACANCY_REASON: DEPART_WORKFORCE
       
      <IN_AND_OUT-4> :=
      IO_PERSON: <PERSON-1>
      NEW_STATUS: IN
      ON_THE_JOB: NO
      OTHER_ORG: <ORGANIZATION-1>
      REL_OTHER_ORG: SAME_ORG
       
      <ORGANIZATION-1> :=
      ORG_NAME: "McCann-Erickson"
      ORG_ALIAS: "McCann"
      ORG_TYPE: COMPANY
       
      <PERSON-1> :=
      PER_NAME: "John J. Dooner Jr."
      PER_ALIAS: "John Dooner" "Dooner"
      21-Mar-2011
      35
      NLP & IR
    • 38. Information Extraction
      Can now use as metadata (for retrieval)
      Or store in DB and query against it, e.g.
      “show me all the events where a finance officer left a company to take up the position of CEO in another company”
      21-Mar-2011
      36
      NLP & IR
    • 39. IE Example: Yahoo Correlator
      21-Mar-2011
      37
      NLP & IR
    • 40. IE Example: Google Wonder Wheel
      21-Mar-2011
      38
      NLP & IR
    • 41. IE Example: Yahoo Search Assist
      21-Mar-2011
      39
      NLP & IR
    • 42. Information Extraction
      But IE is difficult when concepts are spread across multiple sentences:
      Pace American Group Inc. said it notified two top executives it intends to dismiss them because an internal investigation found evidence of “self-dealing” and “undisclosed financial relationships.” The executives are Don H. Pace, president and CEO; and Greg S. Kaplan, SVP and CFO.
      Event & org in first S, but names in second S
      Need to identify connection between “two top executives” and “the executives”.
      21-Mar-2011
      40
      NLP & IR
    • 43. Information Extraction
      Anaphora resolution relies on context (semantic / pragmatic knowledge):
      We gave the bananas to the monkeys because they were hungry.
      We gave the bananas to the monkeys because they were ripe.
      We gave the bananas to the monkeys because they were here.
      21-Mar-2011
      41
      NLP & IR
    • 44. WordNet
      Semantic database for English
      See also EuroWordNet
      Based on synsets
      car = automobile | railway car | elevator car | cable car
      1stsynset = car, motorcar, machine, auto, automobile
      Connected via relationships
      E.g. hypernymy (a kind of)
      separate hierarchies for nouns, verbs, adjectives and adverbs
      21-Mar-2011
      42
      NLP & IR
    • 45. Word Sense Disambiguation
      Resolve ambiguities due to polysemy
      Often characterised by syntax, e.g.
      Magnesium is a light metal (ADJ)
      The light in the kitchen is quite bright (NOUN)
      Can use a Part of Speech Tagger to resolve broad ambiguities
      PoS tags = NN, NNP, NNS, NNPS, etc.
      21-Mar-2011
      43
      NLP & IR
    • 46. Dictionary Definition of “set”
      set
         /sɛt/ Show Spelled [set] Show IPA verb, set, set·ting, noun, adjective, interjection
      –verb (used with object) 1. to put (something or someone) in a particular place: to set a vase on a table.
      2. to place in a particular position or posture: Set thebaby on his feet.
      3. to place in some relation to something or someone: We set a supervisor over the new workers.
      4. to put into some condition: to set a house on fire.
      5. to put or apply: to set fire to a house.
      6. to put in the proper position: to set a chair back on its feet.
      7. to put in the proper or desired order or condition for use: to set a trap.
      8. to distribute or arrange china, silver, etc., for use on (a table): to set the table for dinner.
      9. to place (the hair, especially when wet) on rollers, in clips, or the like, so that the hair will assume a particular style.
      10. to put (a price or value) upon something: He set $7500 as the right amount for the car. The teacher sets a high value on neatness.
      11. to fix the value of at a certain amount or rate; value: He set the car at $500. She sets neatness at a high value.
      12. to post, station, or appoint for the purpose of performing some duty: to set spies on a person.
      13. to determine or fix definitely: to set a time limit.
      14. to resolve or decide upon: to set a wedding date.
      15. to cause to pass into a given state or condition: to set one's mind at rest; to set a prisoner free.
      16. to direct or settle resolutely or wishfully: to set one's mind to a task.
      17. to present as a model; place before others as a standard: to set a good example.
      18. to establish for others to follow: to set a fast pace.
      19. to prescribe or assign, as a task.
      20. to adjust (a mechanism) so as to control its performance.
      • 21. to adjust the hands of (a clock or watch) according to a certain standard: I always set my watch by the clock in the library.
      • 47. 22. to adjust (a timer, alarm of a clock, etc.) so as to sound when desired: He set the alarm for seven o'clock.
      • 48. 23. to fix or mount (a gem or the like) in a frame or setting.
      • 49. 24. to ornament or stud with gems or the like: a bracelet set with pearls.
      • 50. 25. to cause to sit; seat: to set a child in a highchair.
      • 51. 26. to put (a hen) on eggs to hatch them.
      • 52. 27. to place (eggs) under a hen or in an incubator for hatching.
      • 53. 28. to place or plant firmly: to set a flagpole in concrete.
      • 54. 29. to put into a fixed, rigid, or settled state, as the face, muscles, etc.
      • 55. 30. to fix at a given point or calibration: to set the dial on an oven; to set a micrometer.
      • 56. 31. to tighten (often followed by up ): to set nuts well up.
      • 57. 32. to cause to take a particular direction: to set one's course to the south.
      • 58. 33. Surgery . to put (a broken or dislocated bone) back in position.
      • 59. 34. (of a hunting dog) to indicate the position of (game) by standing stiffly and pointing with the muzzle.
      • 60. 35. Music . a. to fit, as words to music.
      • 61. b. to arrange for musical performance.
      • 62. c. to arrange (music) for certain voices or instruments.
      • 63. 36. Theater . a. to arrange the scenery, properties, lights, etc., on (a stage) for an act or scene.
      • 64. b. to prepare (a scene) for dramatic performance.
      • 65. 37. Nautical . to spread and secure (a sail) so as to catch the wind.
      • 66. 38. Printing . a. to arrange (type) in the order required for printing.
      • 67. b. to put together types corresponding to (copy); compose in type: to set an article.
      • 68. 39. Baking . to put aside (a substance to which yeast has been added) in order that it may rise.
      • 69. 40. to change into curd: to set milk with rennet.
      • 70. 41. to cause (glue, mortar, or the like) to become fixed or hard.
      • 71. 42. to urge, goad, or encourage to attack: to set the hounds on a trespasser.
      • 72. 43. Bridge . to cause (the opposing partnership or their contract) to fall short: We set them two tricks at four spades. Only perfect defense could set four spades.
      • 73. 44. to affix or apply, as by stamping: The king set his seal to the decree.
      • 74. 45. to fix or engage (a fishhook) firmly into the jaws of a fish by pulling hard on the line once the fish has taken the bait.
      • 75. 46. to sharpen or put a keen edge on (a blade, knife, razor, etc.) by honing or grinding.
      • 76. 47. to fix the length, width, and shape of (yarn, fabric, etc.).
      • 77. 48. Carpentry . to sink (a nail head) with a nail set.
      • 78. 49. to bend or form to the proper shape, as a saw tooth or a spring.
      • 79. 50. to bend the teeth of (a saw) outward from the blade alternately on both sides in order to make a cut wider than the blade itself.
      • 80. –verb (used without object) 51. to pass below the horizon; sink: The sun sets early in winter.
      • 81. 52. to decline; wane.
      • 82. 53. to assume a fixed or rigid state, as the countenance or the muscles.
      • 83. 54. (of the hair) to be placed temporarily on rollers, in clips, or the like, in order to assume a particular style: Long hair sets more easily than short hair.
      • 84. 55. to become firm, solid, or permanent, as mortar, glue, cement, or a dye, due to drying or physical or chemical change.
      • 85. 56. to sit on eggs to hatch them, as a hen.
      • 86. 57. to hang or fit, as clothes.
      • 87. 58. to begin to move; start (usually followed by forth, out, off,  etc.).
      • 88. 59. (of a flower's ovary) to develop into a fruit.
      • 89. 60. (of a hunting dog) to indicate the position of game.
      21-Mar-2011
      44
      NLP & IR
    • 90. Dictionary Definition of “set”
      61. to have a certain direction or course, as a wind, current, or the like.
      62. Nautical . (of a sail) to be spread so as to catch the wind.
      63. Printing . (of type) to occupy a certain width: This copy sets to forty picas.
      64. Nonstandard . sit: Come in and set a spell.
      –noun 65. the act or state of setting or the state of being set.
      66. a collection of articles designed for use together: a set of china; a chess set.
      67. a collection, each member of which is adapted for a special use in a particular operation: a set of golf clubs; a set of carving knives.
      68. a number, group, or combination of things of similar nature, design, or function: a set of ideas.
      69. a series of volumes by one author, about one subject, etc.
      70. a number, company, or group of persons associated by common interests, occupations, conventions, or status: a set of murderous thieves; the smart set.
      71. the fit, as of an article of clothing: the set of his coat.
      72. fixed direction, bent, or inclination: The set of his mind was obvious.
      73. bearing or carriage: the set of one's shoulders.
      74. the assumption of a fixed, rigid, or hard state, as by mortar or glue.
      75. the arrangement of the hair in a particular style: How much does the beauty parlor charge for a shampoo and set?
      • 76. a plate for holding a tool or die.
      • 91. 77. an apparatus for receiving radio or television programs; receiver.
      • 92. 78. Philately . a group of stamps that form a complete series.
      • 93. 79. Tennis . a unit of a match, consisting of a group of not fewer than sixgames with a margin of at least two games between the winner and loser: He won the match in straight sets of 6–3, 6–4, 6–4.
      • 94. 80. a construction representing a place or scene in which the action takes place in a stage, motion-picture, or television production.
      • 95. 81. Machinery . a. the bending out of the points of alternate teeth of a saw in opposite directions.
      • 96. b. a permanent deformation or displacement of an object or part.
      • 97. c. a tool for giving a certain form to something, as a saw tooth.
      • 98. 82. a chisel having a wide blade for dividing bricks.
      • 99. 83. Horticulture . a young plant, or a slip, tuber, or the like, suitable for planting.
      • 100. 84. Dance . a. the number of couples required to execute a quadrille or the like.
      • 101. b. a series of movements or figures that make up a quadrille or the like.
      • 102. 85. Music . a. a group of pieces played by a band, as in a night club, and followed by an intermission.
      • 103. b. the period during which these pieces are played.
      • 104. 86. Bridge . a failure to take the number of tricks specified by one's contract: Our being vulnerable made the set even more costly.
      • 105. 87. Nautical . a. the direction of a wind, current, etc.
      • 106. b. the form or arrangement of the sails, spars, etc., of a vessel.
      • 107. c. suit ( def. 12 ) .
      • 108. 88. Psychology . a temporary state of an organism characterized by a readiness to respond to certain stimuli in a specific way.
      • 109. 89. Mining . a timber frame bracing or supporting the walls or roof of a shaft or stope.
      • 110. 90. Carpentry . nail set.
      • 111. 90. Carpentry . nail set.
      • 112. 91. Mathematics . a collection of objects or elements classed together.
      • 113. 92. Printing . the width of a body of type.
      • 114. 93. sett ( def. 3 ) .
      • 115. –adjective 94. fixed or prescribed beforehand: a set time; set rules.
      • 116. 95. specified; fixed: The hall holds a set number of people.
      • 117. 96. deliberately composed; customary: set phrases.
      • 118. 97. fixed; rigid: a set smile.
      • 119. 98. resolved or determined; habitually or stubbornly fixed: to be set in one's opinions.
      • 120. 99. completely prepared; ready: Is everyone set?
      • 121. –interjection 100. (in calling the start of a race): Ready! Set! Go! Also, get set!
      21-Mar-2011
      45
      NLP & IR
    • 122. Word Sense Disambiguation
      Surely PoS tagging would help IR performance?
      Could index docs using words AND PoS tags
      Very sensitive to tagger performance
      Does WSD improve IR performance?
      Conflicting evidence …“Perfect” WSD engine may provide only marginal gain, e.g.
      >76% accuracy needed for homonymy
      >55% for polysemy
      21-Mar-2011
      46
      NLP & IR
    • 123. NLP Applications
      The Role of Natural Language Processing in Information Retrieval
      21-Mar-2011
      47
      NLP & IR
    • 124. NLP Applications
      Web & enterprise search
      Text classification
      Text summarisation
      Machine translation
      Human-computer interfaces (spelling correction etc.)
      Speech recognition & synthesis
      Natural language generation (e.g. in games)
      Text mining
      Question answering
      21-Mar-2011
      48
      NLP & IR
    • 125. Finally becoming mainstream?
      ‘80% of corporate information is unstructured’
      Entire value chain for some organisation (media / publishing etc.)
      Retail / eCommerce: Product reviews
      User generated content: blogs, forums, wikis
      Voice of the Customer: social media + sentiment analysis
      161 billion gigabytes of digital information in 2006
      approximately 988 exabytes by 2010
      Audio / video still needs summaries & tags etc.
      21-Mar-2011
      49
      NLP & IR
    • 126. Market Outlook
      Combine component technologies
      e.g. PoS tagging + NER, etc.
      Healthy growth: massive expansion in social media
      Lightweight NLP for buzz analysis, brand monitoring, etc.
      Dominant markets:
      Customer Experience (VoC), media/publishing, financial services & insurance, intelligence, life sciences, e-discovery
      Solutions still not standardized
      Need for self-service tuning & configuration
      Partner ecosystem developing
      Marketing services providers, platform vendors, CRM + call centre vendors, system integrators
      21-Mar-2011
      50
      NLP & IR
    • 127. Text Mining
      Analogy with Data Mining
      Discover or infer new knowledge from unstructured text resources
      A <-> B and B <-> C
      Infer A <-> C?
      e.g. link between migraine headaches and magnesium deficiency
      Applications in life sciences, media/publishing, counter terrorism and competitive intelligence
      21-Mar-2011
      51
      NLP & IR
    • 128. Text Mining
      Levels of text mining, e.g. for bioscience:
      Tagging entities: e.g. genes, proteins, diseases and chemical compounds
      Information extraction: identify meaningful structural relations within the text (e.g. interactions between proteins)
      Knowledge discovery: combine with other knowledge sources (such as external ontologies) to infer or discover new knowledge
      21-Mar-2011
      52
      NLP & IR
    • 129. Question Answering
      Going beyond the document retrieval paradigm
      provide specific answers to specific questions
      Introduced as a TREC track in 1999
      21-Mar-2011
      53
      NLP & IR
    • 130. Question Answering
      Yes/no questions
      “Is George W. Bush the current president of the USA?”
      “Is the Sea of Tranquility deep?”
      “Wh” questions
      “Who was the British Prime Minister before Margaret Thatcher?”
      “When was the Battle of Hastings?”
      List questions
      “Which football teams have won the Champions League this decade?”
      “Which roads lead to Rome?”
      Instruction-based questions
      “How do I cook lasagne?”,
      “What is the best way to build a bridge?”
      Explanation questions
      “Why did World War I start?”
      “How does a computer process floating point numbers?”
      Commands
      “Tell me the height of the Eiffel Tower.”
      “Name all the Kings of England”
      21-Mar-2011
      54
      NLP & IR
    • 131. Question Answering
      Common use cases handled by major search engines
      Google, Yahoo!, Bing, Baidu, ...
      Specialist QA systems
      Wolfram Alpha: http://www.wolframalpha.com
      True Knowledge: http://www.trueknowledge.com
      Powerset: http://www.powerset.com/
      IBM Watson
      21-Mar-2011
      55
      NLP & IR
    • 132. Question Answering
      Basic approach:
      question analysis
      Predict type of answer expected
      Create query
      document retrieval
      answer extraction
      Based on output of (1)
      Varies from regex to deep linguistic analysis
      21-Mar-2011
      56
      NLP & IR
    • 133. Evaluation
      How do we measure NLP accuracy?
      Compare against human annotation
      However:
      Humans often disagree
      Particularly for complex tasks
      Comparison can be measured in many ways
      “Bill Gates is chairman of Microsoft” -> “Gates” = person?
      21-Mar-2011
      57
      NLP & IR
    • 134. Conclusions
      The Role of Natural Language Processing in Information Retrieval
      21-Mar-2011
      58
      NLP & IR
    • 135. Conclusions
      NLP’s contribution to IR not always apparent
      Overemphasis on precision & recall?
      NLP provides benefits that go beyond quantitative metrics to support broader tasks & richer experiences
      Insight, analysis, visualisation, etc.
      NLP required for text mining, question answering & cross-language IR
      Bag of words model is insufficient
      Web becoming increasingly multi-lingual
      21-Mar-2011
      59
      NLP & IR
    • 136. State of the Art
      Commodity technologies:
      Tokenisation, normalization, regular expression search
      Mainstream, but room for improvement:
      Lemmatization, spelling correction, speech synthesis
      Mainstream but needs improvement:
      PoS tagging, term extraction, summarization, speech recognition, text categorization, spoken dialogue systems
      Almost there:
      Information extraction, NL generation, simple speech translation
      Don’t hold your breath:
      Full machine translation, advanced dialogue systems
      21-Mar-2011
      60
      NLP & IR
    • 137. Platforms & Toolkits
      Many commercialvendors
      Open source/access:
      GATE (Sheffield University)
      Open source, linguistic focus
      RapidMiner
      OpenNLP
      OpenCalais
      LingPipe
      nltk
      21-Mar-2011
      61
      NLP & IR
    • 138. Further Reading
      A. Goker & J. Davies, Information Retrieval: Searching in the 21st Century. Wiley-Blackwell, 2009
      D. Jurafsky and J. Martin, Speech and Language Processing. Prentice-Hall, 2nd edition, 2008.
      Manning, C. and Schutze, H. Foundations of Statistical Language Processing. MIT Press, 1999.
      21-Mar-2011
      62
      NLP & IR
    • 139. Thank you
      Tony Russell-Rose, PhD
      Vice-chair, BCS IRSG
      Chair, IEHF HCI Group
      Email: tgr@uxlabs.co.uk
      Blog: http://isquared.wordpress.com
      LinkedIn: http://www.linkedin.com/tonyrussellrose/
      Twitter: @tonygrr

    ×