• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
The Role of Natural Language Processing in Information Retrieval
 

The Role of Natural Language Processing in Information Retrieval

on

  • 8,454 views

The Role of Natural Language Processing in Information Retrieval: Searching for Meaning in Text

The Role of Natural Language Processing in Information Retrieval: Searching for Meaning in Text

Statistics

Views

Total Views
8,454
Views on SlideShare
7,586
Embed Views
868

Actions

Likes
38
Downloads
31
Comments
1

6 Embeds 868

http://isquared.wordpress.com 842
url_unknown 10
http://www.linkedin.com 7
http://webcache.googleusercontent.com 5
http://translate.googleusercontent.com 3
http://www.twylah.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • I just posted a whitepaper on SlideShare describing the differences between lemmatization and stemming which may be of interest:

    http://www.slideshare.net/andrew_paulsen/high-quality-search-european-languages
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Exercise: experiment with the 3 major search engines (Google, Bing, Yahoo) to see how they handle stop words. Try queries that are dominated by stop words, with and without quotes. How do they differ?
  • Exercise: experiment with the 3 major search engines (Google, Bing, Yahoo) to see how they handle stemming. Try queries that are dominated by inflected forms, with and without quotes. How do they differ?
  • Do as class exercise
  • Demo: ELIZA http://nlp-addiction.com/eliza/
  • Demo Google suggest
  • Demo Auto-correct & DYM
  • Demo: try Google Translate (or Babelfish) with sentences containing ambiguous terms or NEs, e.g. “I live in a new house in new york”. Use the speech synthesizerhttp://translate.google.com/#en|de|I%20live%20in%20a%20new%20house%20in%20new%20york
  • Demo Yahoo search assist + “explore concepts”
  • Paired exercise: what is the most polysemous English word you can think of? How many senses do you think it has in an average dictionary? Think of some polysemous words (examples include “ball”, “bank”, “port” but there are many others). Use these as queries in a search engine and comment on what you notice about the results.
  • Paired Exercise: try each of above queries in Google, Yahoo, Bing. Now try to answer the same questions with a normal web search (by entering a set of query terms instead of a natural language question). Are the results of this search any better or worse?
  • Paired exercise: try previous queries on these sites

The Role of Natural Language Processing in Information Retrieval The Role of Natural Language Processing in Information Retrieval Presentation Transcript

  • The Role of Natural Language Processing in Information RetrievalSearching for Meaning in Text
    Tony Russell-Rose, PhD
    21-Mar-2011
  • Contents
    Introduction & Overview
    Background, Terminology
    NLP Fundamentals
    Paradigms & Perspectives
    NLP Tools & Techniques
    NLP Applications
    Text mining, Question Answering, etc.
    Conclusions & Further Reading
    21-Mar-2011
    2
    NLP & IR
  • Introduction & Overview
    The Role of Natural Language Processing in Information Retrieval
    21-Mar-2011
    3
    NLP & IR
  • Information Retrieval
    Models, techniques & frameworks for satisfying an information need
    Common paradigm: document retrieval
    21-Mar-2011
    4
    NLP & IR
  • Document Retrieval
    Methods include Boolean, vector space, probabilistic
    Rely on index terms
    “bag of words”
    Stop lists + stemming
    But text is “unstructured”
    Information may be “hidden”
    21-Mar-2011
    5
    NLP & IR
  • NLP – a solved problem?
    As humans we do it effortlessly ... don’t we?
    DRUNK GETS NINE YEARS IN VIOLIN CASE
    PROSTITUTES APPEAL TO POPE
    STOLEN PAINTING FOUND BY TREE
    RED TAPE HOLDS UP NEW BRIDGE
    DEER KILL 300,000
    RESIDENTS CAN DROP OFF TREES
    INCLUDE CHILDREN WHEN BAKING COOKIES
    MINERS REFUSE TO WORK AFTER DEATH  
    21-Mar-2011
    6
    NLP & IR
  • Problems with Text (1)
    Polysemy
    One word maps to many concepts
    e.g. bat
    Synonymy
    One concept maps to many words
    21-Mar-2011
    7
    NLP & IR
  • Problems with Text (2)
    Word order
    Venetian blind vs. Blind venetian
    21-Mar-2011
    8
    NLP & IR
  • Problems with Text (3)
    Language is generative
    Starbucks coffee is the best
    The place I like most when I need to feed my caffeine addiction is the company from Seattle with branches everywhere
    Many different ways to express a given idea
    Synonymy, paraphrase, metaphor, etc.
    21-Mar-2011
    9
    NLP & IR
  • Problems with Text (4)
    • Frege's principle:
    • The meaning of a sentence is completely determined by the meaning of its symbols and the syntax used to combine them
    Language is a form of communication
    All communication has a context:
    time and place of utterance, the writer, the reader, their backgroundknowledge, intentions, assumptions re the reader’s knowledge/intentions, etc.
    21-Mar-2011
    10
    NLP & IR
  • Problems with Text (5)
    Language is changing
    I want to buy a mobile
    Ill-formed input
    “accomodation office”
    Co-ordination, negation, etc.
    This is not a talk about neuro-linguistic programming
    Multi-linguality
    Claudia Schiffer is on the cover of Elle
    Also sarcasm, irony, slang, jargon, etc.
    That was a wicked lecture
    Yep – the coffee break was the best part
    21-Mar-2011
    11
    NLP & IR
  • Enter NLP / Text Analytics…
    Text Analytics: a set of linguistic, analytical and predictive techniques to extract structure and meaning from unstructured documents
    NLP: the academic term for Text Analytics
    analogous to “search” vs. “IR”
    Text Analytics ~= Natural Language Processing ~= Text Mining
    Text Mining -> Scientific / technical context, automated processing
    Text Analytics -> Business context, interactive apps
    21-Mar-2011
    12
    NLP & IR
  • NLP Fundamentals
    The Role of Natural Language Processing in Information Retrieval
    21-Mar-2011
    13
    NLP & IR
  • NLP Perspectives
    Computational Linguistics
    Use of computational techniques to study linguistic phenomena
    Cognitive science
    Study of human information processing (perception, language, reasoning, etc.)
    Computer science
    Theoretical foundations of computation and practical techniques for implementation
    Information science
    Analysis, classification, manipulation, retrieval and dissemination of information
    21-Mar-2011
    14
    NLP & IR
  • NLP Paradigms
    Symbolic approaches
    Rule-based, hand coded (by linguists)
    Knowledge-intensive
    Statistical approaches
    Shallow statistical models, trained on annotated corpora
    Data-intensive
    21-Mar-2011
    15
    NLP & IR
  • NLP Fundamentals (word level)
    Language is AMBIGUOUS
    To find structure, we must remove ambiguity!
    Lexical analysis (tokenisation)
    The cat sat on the mat
    I can’t tokenise this sentence
    Stop word removal
    No definitive list
    The Who, The The, Take That…
    To be or not to be
    21-Mar-2011
    16
    NLP & IR
  • NLP Fundamentals (word level)
    Stemming
    fishing, fished, fish, fisher -> fish
    Lemmatization
    Stemming + context
    Passing -> pass + ING
    Were -> be + PAST
    Delegate = de-leg-ate (?)
    Ratify = rat-ify (?)
    Morphology (prefixes, suffixes, etc.)
    Gebäudereinigungsfirmenangestellter -> Gebäude + Reinigung + Firma + Angestellter (building + cleaning + company + employee)
    21-Mar-2011
    17
    NLP & IR
  • NLP Fundamentals (sentence level)
    Syntax (part of speech tagging)
    book -> NOUN, VERB
    that -> DETERMINER
    flight -> NOUN
    Book that flight -> VB DT NN
    Ambiguity problem
    Time flies like an arrow -> ?
    Fruit flies like a banana -> ?
    Eats shoots and leaves -> ?
    21-Mar-2011
    18
    NLP & IR
  • NLP Fundamentals (sentence level)
    Parsing (grammar)
    I saw a venetian blind
    I saw a blind venetian
    I saw the man on the hill with a telescope
    Rugby is a game played by men with odd-shaped balls
    Sentence boundary detection
    Punctuation denotes the end of a sentence!
    “But not always!”, said Fred...
    21-Mar-2011
    19
    NLP & IR
  • NLP Tools & Techniques
    The Role of Natural Language Processing in Information Retrieval
    21-Mar-2011
    20
    NLP & IR
  • Early NLP Systems
    ELIZA
    Wiezenbaum 1966
    Simple pattern matching
    PARRY
    Colby et al 1971
    Pattern matching & planning
    SHRDLU
    Winograd 1972
    Natural language understanding
    Comprehensive grammar of English
    21-Mar-2011
    21
    NLP & IR
  • Word Prediction
    Assistive technologies
    Aurora Systems, TextHelp
    Google, Bing, Yahoo query suggestions
    21-Mar-2011
    22
    NLP & IR
  • Spelling Correction
    Auto-correct
    Did You Mean
    21-Mar-2011
    23
    NLP & IR
  • Text Categorization
    News agencies:
    classify incoming news stories
    Search engines:
    classify queries, e.g. search Google for ‘LOG313’
    Identifying spam emails, e.g.
    http://www.paulgraham.com/spam.html
    Routing email or documents to appropriate people
    21-Mar-2011
    24
    NLP & IR
  • Terminology Extraction
    Differentiate between useful index terms and ‘noise’
    Help lexicographers identify new terminology:
    the C4 carbons of the nicotinamide rings
    X-ray therapy vs. X-ray therapies
    PAHO vs. Pan-American Health Organization
    Term extraction systems process scientific papers to identify terminology, possibly comparing it with a known list
    21-Mar-2011
    25
    NLP & IR
  • Speech Recognition
    Spoken Dialogue Systems
    Directory enquiries (AT&T)
    Railways (Germany, Switzerland, Italy)
    Airlines (USA)
    iPhone Voice Search
    IBM Watson
    21-Mar-2011
    26
    NLP & IR
  • Named Entity Recognition
    Identification of key concepts,
    e.g. people, places, organisations, etc.
    Also postcodes, temporal/numerical expressions, etc.
    “Mexico has been trying to stage a recovery since the beginning of this year and it's always been getting ahead of itself in terms of fundamentals,” said Matthew Hickman of Lehman Brothers in New York.”
    21-Mar-2011
    27
    NLP & IR
  • Named Entity Recognition: Applications
    Increase precision of IR
    New companies in York vs. Companies in New York
    Support navigation
    SMART tags, Apture, etc.
    Improve machine translation:
    “I live in a new house in new york”
    Speech synthesis, auto-summarisation, etc.
    21-Mar-2011
    28
    NLP & IR
  • Named Entity Recognition
    Systems usually developed for specific domain
    Are typically language and domain-specific
    Accuracy > 90% for newswire text
    Best system at MUC-7 scored 93.39% f-measure
    human annotators scored 97.60% and 96.95%.
    Lower for certain domains, e.g. life sciences (species, substances, proteins, genes, etc.)
    extended naming conventions, extensive use of conjunction and disjunction, high occurrence of neologisms and abbreviations, etc.
    21-Mar-2011
    29
    NLP & IR
  • NER (Newssift Example)
    21-Mar-2011
    30
    NLP & IR
  • Machine Translation (before)
    21-Mar-2011
    31
    NLP & IR
  • Machine Translation (after)
    21-Mar-2011
    32
    NLP & IR
  • Summarisation
    Section 3 discusses two information access applications (text mining and question answering) closely associated with NLP.
    NLP Techniques
    Named entity recognition
    Information Extraction
    Current document retrieval technologies could not identify information as specific as this within text.
    Word Sense Disambiguation
    Text Mining
    Question Answering
    “Wh” questions
    List questions
    Examples include those mentioned here: text mining, question answering and cross-language information retrieval.
    Ponte and Croft (1998) “A Language Modeling Approach to Information Retrieval” SIGIR.
     
    Sanderson, M. (1994) “Word sense disambiguation and information retrieval” Proceedings of the 17th ACM SIGIR Conference
     
    Sanderson, M. (2000) “Retrieving with Good Sense” Information Retrieval 2(1):49-69.
     
    Question Answering (Section 3.2).
     
    Summarization
    single-document vs. multi-document
    MS Word
    Bing
    21-Mar-2011
    33
    NLP & IR
  • Information Extraction
    • Identification of entities + relationships
    • Based on pre-defined structures, e.g.
    movements of company executives
    victims of terrorist attacks
    corporate mergers / acquisitions
    interactions between genes and proteins
    Example (from MUC-6):
    Now, Mr. James is preparing to sail into the sunset, and Mr. Dooner is poised to rev up the engines to guide Interpublic Group's McCann-Erickson into the 21st century. Yesterday, McCann made official what had been widely anticipated: Mr. James, 57 years old, is stepping down as chief executive officer on July 1 and will retire as chairman at the end of the year. He will be succeeded by Mr. Dooner, 45.
    21-Mar-2011
    34
    NLP & IR
  • Information Extraction
    <SUCCESSION_EVENT-2> :=
    SUCCESSION_ORG:
    <ORGANIZATION-1>
    POST: "chairman"
    IN_AND_OUT: <IN_AND_OUT-4>
    VACANCY_REASON: DEPART_WORKFORCE
     
    <IN_AND_OUT-4> :=
    IO_PERSON: <PERSON-1>
    NEW_STATUS: IN
    ON_THE_JOB: NO
    OTHER_ORG: <ORGANIZATION-1>
    REL_OTHER_ORG: SAME_ORG
     
    <ORGANIZATION-1> :=
    ORG_NAME: "McCann-Erickson"
    ORG_ALIAS: "McCann"
    ORG_TYPE: COMPANY
     
    <PERSON-1> :=
    PER_NAME: "John J. Dooner Jr."
    PER_ALIAS: "John Dooner" "Dooner"
    21-Mar-2011
    35
    NLP & IR
  • Information Extraction
    Can now use as metadata (for retrieval)
    Or store in DB and query against it, e.g.
    “show me all the events where a finance officer left a company to take up the position of CEO in another company”
    21-Mar-2011
    36
    NLP & IR
  • IE Example: Yahoo Correlator
    21-Mar-2011
    37
    NLP & IR
  • IE Example: Google Wonder Wheel
    21-Mar-2011
    38
    NLP & IR
  • IE Example: Yahoo Search Assist
    21-Mar-2011
    39
    NLP & IR
  • Information Extraction
    But IE is difficult when concepts are spread across multiple sentences:
    Pace American Group Inc. said it notified two top executives it intends to dismiss them because an internal investigation found evidence of “self-dealing” and “undisclosed financial relationships.” The executives are Don H. Pace, president and CEO; and Greg S. Kaplan, SVP and CFO.
    Event & org in first S, but names in second S
    Need to identify connection between “two top executives” and “the executives”.
    21-Mar-2011
    40
    NLP & IR
  • Information Extraction
    Anaphora resolution relies on context (semantic / pragmatic knowledge):
    We gave the bananas to the monkeys because they were hungry.
    We gave the bananas to the monkeys because they were ripe.
    We gave the bananas to the monkeys because they were here.
    21-Mar-2011
    41
    NLP & IR
  • WordNet
    Semantic database for English
    See also EuroWordNet
    Based on synsets
    car = automobile | railway car | elevator car | cable car
    1stsynset = car, motorcar, machine, auto, automobile
    Connected via relationships
    E.g. hypernymy (a kind of)
    separate hierarchies for nouns, verbs, adjectives and adverbs
    21-Mar-2011
    42
    NLP & IR
  • Word Sense Disambiguation
    Resolve ambiguities due to polysemy
    Often characterised by syntax, e.g.
    Magnesium is a light metal (ADJ)
    The light in the kitchen is quite bright (NOUN)
    Can use a Part of Speech Tagger to resolve broad ambiguities
    PoS tags = NN, NNP, NNS, NNPS, etc.
    21-Mar-2011
    43
    NLP & IR
  • Dictionary Definition of “set”
    set
       /sɛt/ Show Spelled [set] Show IPA verb, set, set·ting, noun, adjective, interjection
    –verb (used with object) 1. to put (something or someone) in a particular place: to set a vase on a table.
    2. to place in a particular position or posture: Set thebaby on his feet.
    3. to place in some relation to something or someone: We set a supervisor over the new workers.
    4. to put into some condition: to set a house on fire.
    5. to put or apply: to set fire to a house.
    6. to put in the proper position: to set a chair back on its feet.
    7. to put in the proper or desired order or condition for use: to set a trap.
    8. to distribute or arrange china, silver, etc., for use on (a table): to set the table for dinner.
    9. to place (the hair, especially when wet) on rollers, in clips, or the like, so that the hair will assume a particular style.
    10. to put (a price or value) upon something: He set $7500 as the right amount for the car. The teacher sets a high value on neatness.
    11. to fix the value of at a certain amount or rate; value: He set the car at $500. She sets neatness at a high value.
    12. to post, station, or appoint for the purpose of performing some duty: to set spies on a person.
    13. to determine or fix definitely: to set a time limit.
    14. to resolve or decide upon: to set a wedding date.
    15. to cause to pass into a given state or condition: to set one's mind at rest; to set a prisoner free.
    16. to direct or settle resolutely or wishfully: to set one's mind to a task.
    17. to present as a model; place before others as a standard: to set a good example.
    18. to establish for others to follow: to set a fast pace.
    19. to prescribe or assign, as a task.
    20. to adjust (a mechanism) so as to control its performance.
    • 21. to adjust the hands of (a clock or watch) according to a certain standard: I always set my watch by the clock in the library.
    • 22. to adjust (a timer, alarm of a clock, etc.) so as to sound when desired: He set the alarm for seven o'clock.
    • 23. to fix or mount (a gem or the like) in a frame or setting.
    • 24. to ornament or stud with gems or the like: a bracelet set with pearls.
    • 25. to cause to sit; seat: to set a child in a highchair.
    • 26. to put (a hen) on eggs to hatch them.
    • 27. to place (eggs) under a hen or in an incubator for hatching.
    • 28. to place or plant firmly: to set a flagpole in concrete.
    • 29. to put into a fixed, rigid, or settled state, as the face, muscles, etc.
    • 30. to fix at a given point or calibration: to set the dial on an oven; to set a micrometer.
    • 31. to tighten (often followed by up ): to set nuts well up.
    • 32. to cause to take a particular direction: to set one's course to the south.
    • 33. Surgery . to put (a broken or dislocated bone) back in position.
    • 34. (of a hunting dog) to indicate the position of (game) by standing stiffly and pointing with the muzzle.
    • 35. Music . a. to fit, as words to music.
    • b. to arrange for musical performance.
    • c. to arrange (music) for certain voices or instruments.
    • 36. Theater . a. to arrange the scenery, properties, lights, etc., on (a stage) for an act or scene.
    • b. to prepare (a scene) for dramatic performance.
    • 37. Nautical . to spread and secure (a sail) so as to catch the wind.
    • 38. Printing . a. to arrange (type) in the order required for printing.
    • b. to put together types corresponding to (copy); compose in type: to set an article.
    • 39. Baking . to put aside (a substance to which yeast has been added) in order that it may rise.
    • 40. to change into curd: to set milk with rennet.
    • 41. to cause (glue, mortar, or the like) to become fixed or hard.
    • 42. to urge, goad, or encourage to attack: to set the hounds on a trespasser.
    • 43. Bridge . to cause (the opposing partnership or their contract) to fall short: We set them two tricks at four spades. Only perfect defense could set four spades.
    • 44. to affix or apply, as by stamping: The king set his seal to the decree.
    • 45. to fix or engage (a fishhook) firmly into the jaws of a fish by pulling hard on the line once the fish has taken the bait.
    • 46. to sharpen or put a keen edge on (a blade, knife, razor, etc.) by honing or grinding.
    • 47. to fix the length, width, and shape of (yarn, fabric, etc.).
    • 48. Carpentry . to sink (a nail head) with a nail set.
    • 49. to bend or form to the proper shape, as a saw tooth or a spring.
    • 50. to bend the teeth of (a saw) outward from the blade alternately on both sides in order to make a cut wider than the blade itself.
    • –verb (used without object) 51. to pass below the horizon; sink: The sun sets early in winter.
    • 52. to decline; wane.
    • 53. to assume a fixed or rigid state, as the countenance or the muscles.
    • 54. (of the hair) to be placed temporarily on rollers, in clips, or the like, in order to assume a particular style: Long hair sets more easily than short hair.
    • 55. to become firm, solid, or permanent, as mortar, glue, cement, or a dye, due to drying or physical or chemical change.
    • 56. to sit on eggs to hatch them, as a hen.
    • 57. to hang or fit, as clothes.
    • 58. to begin to move; start (usually followed by forth, out, off,  etc.).
    • 59. (of a flower's ovary) to develop into a fruit.
    • 60. (of a hunting dog) to indicate the position of game.
    21-Mar-2011
    44
    NLP & IR
  • Dictionary Definition of “set”
    61. to have a certain direction or course, as a wind, current, or the like.
    62. Nautical . (of a sail) to be spread so as to catch the wind.
    63. Printing . (of type) to occupy a certain width: This copy sets to forty picas.
    64. Nonstandard . sit: Come in and set a spell.
    –noun 65. the act or state of setting or the state of being set.
    66. a collection of articles designed for use together: a set of china; a chess set.
    67. a collection, each member of which is adapted for a special use in a particular operation: a set of golf clubs; a set of carving knives.
    68. a number, group, or combination of things of similar nature, design, or function: a set of ideas.
    69. a series of volumes by one author, about one subject, etc.
    70. a number, company, or group of persons associated by common interests, occupations, conventions, or status: a set of murderous thieves; the smart set.
    71. the fit, as of an article of clothing: the set of his coat.
    72. fixed direction, bent, or inclination: The set of his mind was obvious.
    73. bearing or carriage: the set of one's shoulders.
    74. the assumption of a fixed, rigid, or hard state, as by mortar or glue.
    75. the arrangement of the hair in a particular style: How much does the beauty parlor charge for a shampoo and set?
    • 76. a plate for holding a tool or die.
    • 77. an apparatus for receiving radio or television programs; receiver.
    • 78. Philately . a group of stamps that form a complete series.
    • 79. Tennis . a unit of a match, consisting of a group of not fewer than sixgames with a margin of at least two games between the winner and loser: He won the match in straight sets of 6–3, 6–4, 6–4.
    • 80. a construction representing a place or scene in which the action takes place in a stage, motion-picture, or television production.
    • 81. Machinery . a. the bending out of the points of alternate teeth of a saw in opposite directions.
    • b. a permanent deformation or displacement of an object or part.
    • c. a tool for giving a certain form to something, as a saw tooth.
    • 82. a chisel having a wide blade for dividing bricks.
    • 83. Horticulture . a young plant, or a slip, tuber, or the like, suitable for planting.
    • 84. Dance . a. the number of couples required to execute a quadrille or the like.
    • b. a series of movements or figures that make up a quadrille or the like.
    • 85. Music . a. a group of pieces played by a band, as in a night club, and followed by an intermission.
    • b. the period during which these pieces are played.
    • 86. Bridge . a failure to take the number of tricks specified by one's contract: Our being vulnerable made the set even more costly.
    • 87. Nautical . a. the direction of a wind, current, etc.
    • b. the form or arrangement of the sails, spars, etc., of a vessel.
    • c. suit ( def. 12 ) .
    • 88. Psychology . a temporary state of an organism characterized by a readiness to respond to certain stimuli in a specific way.
    • 89. Mining . a timber frame bracing or supporting the walls or roof of a shaft or stope.
    • 90. Carpentry . nail set.
    • 90. Carpentry . nail set.
    • 91. Mathematics . a collection of objects or elements classed together.
    • 92. Printing . the width of a body of type.
    • 93. sett ( def. 3 ) .
    • –adjective 94. fixed or prescribed beforehand: a set time; set rules.
    • 95. specified; fixed: The hall holds a set number of people.
    • 96. deliberately composed; customary: set phrases.
    • 97. fixed; rigid: a set smile.
    • 98. resolved or determined; habitually or stubbornly fixed: to be set in one's opinions.
    • 99. completely prepared; ready: Is everyone set?
    • –interjection 100. (in calling the start of a race): Ready! Set! Go! Also, get set!
    21-Mar-2011
    45
    NLP & IR
  • Word Sense Disambiguation
    Surely PoS tagging would help IR performance?
    Could index docs using words AND PoS tags
    Very sensitive to tagger performance
    Does WSD improve IR performance?
    Conflicting evidence …“Perfect” WSD engine may provide only marginal gain, e.g.
    >76% accuracy needed for homonymy
    >55% for polysemy
    21-Mar-2011
    46
    NLP & IR
  • NLP Applications
    The Role of Natural Language Processing in Information Retrieval
    21-Mar-2011
    47
    NLP & IR
  • NLP Applications
    Web & enterprise search
    Text classification
    Text summarisation
    Machine translation
    Human-computer interfaces (spelling correction etc.)
    Speech recognition & synthesis
    Natural language generation (e.g. in games)
    Text mining
    Question answering
    21-Mar-2011
    48
    NLP & IR
  • Finally becoming mainstream?
    ‘80% of corporate information is unstructured’
    Entire value chain for some organisation (media / publishing etc.)
    Retail / eCommerce: Product reviews
    User generated content: blogs, forums, wikis
    Voice of the Customer: social media + sentiment analysis
    161 billion gigabytes of digital information in 2006
    approximately 988 exabytes by 2010
    Audio / video still needs summaries & tags etc.
    21-Mar-2011
    49
    NLP & IR
  • Market Outlook
    Combine component technologies
    e.g. PoS tagging + NER, etc.
    Healthy growth: massive expansion in social media
    Lightweight NLP for buzz analysis, brand monitoring, etc.
    Dominant markets:
    Customer Experience (VoC), media/publishing, financial services & insurance, intelligence, life sciences, e-discovery
    Solutions still not standardized
    Need for self-service tuning & configuration
    Partner ecosystem developing
    Marketing services providers, platform vendors, CRM + call centre vendors, system integrators
    21-Mar-2011
    50
    NLP & IR
  • Text Mining
    Analogy with Data Mining
    Discover or infer new knowledge from unstructured text resources
    A <-> B and B <-> C
    Infer A <-> C?
    e.g. link between migraine headaches and magnesium deficiency
    Applications in life sciences, media/publishing, counter terrorism and competitive intelligence
    21-Mar-2011
    51
    NLP & IR
  • Text Mining
    Levels of text mining, e.g. for bioscience:
    Tagging entities: e.g. genes, proteins, diseases and chemical compounds
    Information extraction: identify meaningful structural relations within the text (e.g. interactions between proteins)
    Knowledge discovery: combine with other knowledge sources (such as external ontologies) to infer or discover new knowledge
    21-Mar-2011
    52
    NLP & IR
  • Question Answering
    Going beyond the document retrieval paradigm
    provide specific answers to specific questions
    Introduced as a TREC track in 1999
    21-Mar-2011
    53
    NLP & IR
  • Question Answering
    Yes/no questions
    “Is George W. Bush the current president of the USA?”
    “Is the Sea of Tranquility deep?”
    “Wh” questions
    “Who was the British Prime Minister before Margaret Thatcher?”
    “When was the Battle of Hastings?”
    List questions
    “Which football teams have won the Champions League this decade?”
    “Which roads lead to Rome?”
    Instruction-based questions
    “How do I cook lasagne?”,
    “What is the best way to build a bridge?”
    Explanation questions
    “Why did World War I start?”
    “How does a computer process floating point numbers?”
    Commands
    “Tell me the height of the Eiffel Tower.”
    “Name all the Kings of England”
    21-Mar-2011
    54
    NLP & IR
  • Question Answering
    Common use cases handled by major search engines
    Google, Yahoo!, Bing, Baidu, ...
    Specialist QA systems
    Wolfram Alpha: http://www.wolframalpha.com
    True Knowledge: http://www.trueknowledge.com
    Powerset: http://www.powerset.com/
    IBM Watson
    21-Mar-2011
    55
    NLP & IR
  • Question Answering
    Basic approach:
    question analysis
    Predict type of answer expected
    Create query
    document retrieval
    answer extraction
    Based on output of (1)
    Varies from regex to deep linguistic analysis
    21-Mar-2011
    56
    NLP & IR
  • Evaluation
    How do we measure NLP accuracy?
    Compare against human annotation
    However:
    Humans often disagree
    Particularly for complex tasks
    Comparison can be measured in many ways
    “Bill Gates is chairman of Microsoft” -> “Gates” = person?
    21-Mar-2011
    57
    NLP & IR
  • Conclusions
    The Role of Natural Language Processing in Information Retrieval
    21-Mar-2011
    58
    NLP & IR
  • Conclusions
    NLP’s contribution to IR not always apparent
    Overemphasis on precision & recall?
    NLP provides benefits that go beyond quantitative metrics to support broader tasks & richer experiences
    Insight, analysis, visualisation, etc.
    NLP required for text mining, question answering & cross-language IR
    Bag of words model is insufficient
    Web becoming increasingly multi-lingual
    21-Mar-2011
    59
    NLP & IR
  • State of the Art
    Commodity technologies:
    Tokenisation, normalization, regular expression search
    Mainstream, but room for improvement:
    Lemmatization, spelling correction, speech synthesis
    Mainstream but needs improvement:
    PoS tagging, term extraction, summarization, speech recognition, text categorization, spoken dialogue systems
    Almost there:
    Information extraction, NL generation, simple speech translation
    Don’t hold your breath:
    Full machine translation, advanced dialogue systems
    21-Mar-2011
    60
    NLP & IR
  • Platforms & Toolkits
    Many commercialvendors
    Open source/access:
    GATE (Sheffield University)
    Open source, linguistic focus
    RapidMiner
    OpenNLP
    OpenCalais
    LingPipe
    nltk
    21-Mar-2011
    61
    NLP & IR
  • Further Reading
    A. Goker & J. Davies, Information Retrieval: Searching in the 21st Century. Wiley-Blackwell, 2009
    D. Jurafsky and J. Martin, Speech and Language Processing. Prentice-Hall, 2nd edition, 2008.
    Manning, C. and Schutze, H. Foundations of Statistical Language Processing. MIT Press, 1999.
    21-Mar-2011
    62
    NLP & IR
  • Thank you
    Tony Russell-Rose, PhD
    Vice-chair, BCS IRSG
    Chair, IEHF HCI Group
    Email: tgr@uxlabs.co.uk
    Blog: http://isquared.wordpress.com
    LinkedIn: http://www.linkedin.com/tonyrussellrose/
    Twitter: @tonygrr