The Role of Natural Language Processingin Information Retrieval Searching for Meaning in TextTony Russell-Rose, PhD21-Mar-2011
ContentsIntroduction & OverviewBackground, TerminologyNLP FundamentalsParadigms & PerspectivesNLP Tools & TechniquesNLP ApplicationsText mining, Question Answering, etc.Conclusions & Further Reading21-Mar-20112NLP & IR
Introduction & OverviewThe Role of Natural Language Processing in Information Retrieval21-Mar-20113NLP & IR
Information RetrievalModels, techniques & frameworks for satisfying an information needCommon paradigm: document retrieval21-Mar-20114NLP & IR
Document RetrievalMethods include Boolean, vector space, probabilisticRely on index termsā€œbag of wordsā€Stop lists + stemmingBut text is ā€œunstructuredā€Information may be ā€œhiddenā€21-Mar-20115NLP & IR
NLP – a solved problem?As humans we do it effortlessly ... don’t we?DRUNK GETS NINE YEARS IN VIOLIN CASEPROSTITUTES APPEAL TO POPE STOLEN PAINTING FOUND BY TREE RED TAPE HOLDS UP NEW BRIDGEDEER KILL 300,000RESIDENTS CAN DROP OFF TREESINCLUDE CHILDREN WHEN BAKING COOKIES MINERS REFUSE TO WORK AFTER DEATH Ā 21-Mar-20116NLP & IR
Problems with Text (1)PolysemyOne word maps to many conceptse.g. batSynonymyOne concept maps to many words21-Mar-20117NLP & IR
Problems with Text (2)Word orderVenetian blind vs. Blind venetian21-Mar-20118NLP & IR
Problems with Text (3)Language is generativeStarbucks coffee is the bestThe place I like most when I need to feed my caffeine addiction is the company from Seattle with branches everywhereMany different ways to express a given ideaSynonymy, paraphrase, metaphor, etc.21-Mar-20119NLP & IR
Problems with Text (4)Frege's principle:
The meaning of a sentence is completely determined by the meaning of its symbols and the syntax used to combine themLanguage is a form of communicationAll communication has a context:time and place of utterance, the writer, the reader, their backgroundknowledge, intentions, assumptions re the reader’s knowledge/intentions, etc.21-Mar-201110NLP & IR
Problems with Text (5)Language is changingI want to buy a mobileIll-formed inputā€œaccomodation officeā€Co-ordination, negation, etc.This is not a talk about neuro-linguistic programmingMulti-lingualityClaudia Schiffer is on the cover of ElleAlso sarcasm, irony, slang, jargon, etc.That was a wicked lectureYep – the coffee break was the best part21-Mar-201111NLP & IR
Enter NLP / Text Analytics…Text Analytics: a set of linguistic, analytical and predictive techniques to extract structure and meaning from unstructured documentsNLP: the academic term for Text Analyticsanalogous to ā€œsearchā€ vs. ā€œIRā€Text Analytics ~= Natural Language Processing ~= Text MiningText Mining -> Scientific / technical context, automated processingText Analytics -> Business context, interactive apps21-Mar-201112NLP & IR
NLP FundamentalsThe Role of Natural Language Processing in Information Retrieval21-Mar-201113NLP & IR
NLP PerspectivesComputational LinguisticsUse of computational techniques to study linguistic phenomenaCognitive scienceStudy of human information processing (perception, language, reasoning, etc.)Computer scienceTheoretical foundations of computation and practical techniques for implementation Information scienceAnalysis, classification, manipulation, retrieval and dissemination of information21-Mar-201114NLP & IR
NLP ParadigmsSymbolic approachesRule-based, hand coded (by linguists)Knowledge-intensiveStatistical approachesShallow statistical models, trained on annotated corporaData-intensive21-Mar-201115NLP & IR
NLP Fundamentals (word level)Language is AMBIGUOUSTo find structure, we must remove ambiguity!Lexical analysis (tokenisation)The cat sat on the matI can’t tokenise this sentenceStop word removalNo definitive listThe Who, The The, Take That…To be or not to be21-Mar-201116NLP & IR
NLP Fundamentals (word level)Stemmingfishing, fished, fish, fisher -> fishLemmatizationStemming + contextPassing -> pass + INGWere -> be + PASTDelegate = de-leg-ate (?)Ratify = rat-ify (?)Morphology (prefixes, suffixes, etc.)GebƤudereinigungsfirmenangestellter -> GebƤude + Reinigung + Firma + Angestellter (building + cleaning + company + employee)21-Mar-201117NLP & IR
NLP Fundamentals (sentence level)Syntax (part of speech tagging)book -> NOUN, VERBthat -> DETERMINERflight -> NOUNBook that flight -> VB DT NNAmbiguity problemTime flies like an arrow -> ?Fruit flies like a banana -> ?Eats shoots and leaves -> ?21-Mar-201118NLP & IR
NLP Fundamentals (sentence level)Parsing (grammar)I saw a venetian blindI saw a blind venetianI saw the man on the hill with a telescopeRugby is a game played by men with odd-shaped ballsSentence boundary detectionPunctuation denotes the end of a sentence!ā€œBut not always!ā€, said Fred...21-Mar-201119NLP & IR
NLP Tools & TechniquesThe Role of Natural Language Processing in Information Retrieval21-Mar-201120NLP & IR
Early NLP SystemsELIZAWiezenbaum 1966Simple pattern matchingPARRYColby et al 1971Pattern matching & planningSHRDLUWinograd 1972Natural language understandingComprehensive grammar of English21-Mar-201121NLP & IR
Word PredictionAssistive technologiesAurora Systems, TextHelpGoogle, Bing, Yahoo query suggestions21-Mar-201122NLP & IR
Spelling CorrectionAuto-correctDid You Mean21-Mar-201123NLP & IR
Text CategorizationNews agencies:classify incoming news storiesSearch engines:classify queries, e.g. search Google for ā€˜LOG313’Identifying spam emails, e.g.http://www.paulgraham.com/spam.htmlRouting email or documents to appropriate people21-Mar-201124NLP & IR
Terminology ExtractionDifferentiate between useful index terms and ā€˜noise’Help lexicographers identify new terminology:the C4 carbons of the nicotinamide ringsX-ray therapy vs. X-ray therapiesPAHO vs. Pan-American Health OrganizationTerm extraction systems process scientific papers to identify terminology, possibly comparing it with a known list21-Mar-201125NLP & IR
Speech RecognitionSpoken Dialogue SystemsDirectory enquiries (AT&T)Railways (Germany, Switzerland, Italy)Airlines (USA)iPhone Voice SearchIBM Watson21-Mar-201126NLP & IR
Named Entity RecognitionIdentification of key concepts, e.g. people, places, organisations, etc.Also postcodes, temporal/numerical expressions, etc.ā€œMexico has been trying to stage a recovery since the beginning of this year and it's always been getting ahead of itself in terms of fundamentals,ā€ said Matthew Hickman of Lehman Brothers in New York.ā€21-Mar-201127NLP & IR
Named Entity Recognition: ApplicationsIncrease precision of IRNew companies in York vs. Companies in New YorkSupport navigationSMART tags, Apture, etc.Improve machine translation:ā€œI live in a new house in new yorkā€Speech synthesis, auto-summarisation, etc.21-Mar-201128NLP & IR
Named Entity RecognitionSystems usually developed for specific domainAre typically language and domain-specificAccuracy > 90%  for newswire textBest system at MUC-7 scored 93.39% f-measurehuman annotators scored 97.60% and 96.95%.Lower for certain domains, e.g. life sciences (species, substances, proteins, genes, etc.) extended naming conventions, extensive use of conjunction and disjunction, high occurrence of neologisms and abbreviations, etc.21-Mar-201129NLP & IR
NER (Newssift Example)21-Mar-201130NLP & IR
Machine Translation (before)21-Mar-201131NLP & IR
Machine Translation (after)21-Mar-201132NLP & IR
SummarisationSection 3 discusses two information access applications (text mining and question answering) closely associated with NLP. NLP TechniquesNamed entity recognition Information Extraction Current document retrieval technologies could not identify information as specific as this within text. Word Sense DisambiguationText MiningQuestion Answering ā€œWhā€ questionsList questions Examples include those mentioned here: text mining, question answering and cross-language information retrieval. Ponte and Croft (1998) ā€œA Language Modeling Approach to Information Retrievalā€ SIGIR.Ā Sanderson, M. (1994) ā€œWord sense disambiguation and information retrievalā€ Proceedings of the 17th ACM SIGIR ConferenceĀ Sanderson, M. (2000) ā€œRetrieving with Good Senseā€ Information Retrieval 2(1):49-69.Ā Question Answering (Section 3.2).Ā Summarizationsingle-document vs. multi-documentMS WordBing21-Mar-201133NLP & IR
Information ExtractionIdentification of entities + relationships
Based on pre-defined structures, e.g.movements of company executivesvictims of terrorist attackscorporate mergers / acquisitionsinteractions between genes and proteinsExample (from MUC-6):Now, Mr. James is preparing to sail into the sunset, and Mr. Dooner is poised to rev up the engines to guide Interpublic Group's McCann-Erickson into the 21st century. Yesterday, McCann made official what had been widely anticipated: Mr. James, 57 years old, is stepping down as chief executive officer on July 1 and will retire as chairman at the end of the year. He will be succeeded by Mr. Dooner, 45.21-Mar-201134NLP & IR
Information Extraction<SUCCESSION_EVENT-2> :=    SUCCESSION_ORG:         <ORGANIZATION-1>    POST: "chairman"    IN_AND_OUT: <IN_AND_OUT-4>    VACANCY_REASON: DEPART_WORKFORCEĀ <IN_AND_OUT-4> :=    IO_PERSON: <PERSON-1>    NEW_STATUS: IN    ON_THE_JOB: NO    OTHER_ORG: <ORGANIZATION-1>    REL_OTHER_ORG: SAME_ORGĀ <ORGANIZATION-1> :=    ORG_NAME: "McCann-Erickson"    ORG_ALIAS: "McCann"    ORG_TYPE:  COMPANYĀ <PERSON-1> :=    PER_NAME: "John J. Dooner Jr."    PER_ALIAS: "John Dooner" "Dooner"21-Mar-201135NLP & IR
Information ExtractionCan now use as metadata (for retrieval)Or store in DB and query against it, e.g.ā€œshow me all the events where a finance officer left a company to take up the position of CEO in another companyā€21-Mar-201136NLP & IR
IE Example: Yahoo Correlator21-Mar-201137NLP & IR
IE Example: Google Wonder Wheel21-Mar-201138NLP & IR
IE Example: Yahoo Search Assist21-Mar-201139NLP & IR
Information ExtractionBut IE is difficult when concepts are spread across multiple sentences:Pace American Group Inc. said it notified two top executives it intends to dismiss them because an internal investigation found evidence of ā€œself-dealingā€ and ā€œundisclosed financial relationships.ā€ The executives are Don H. Pace, president and CEO; and Greg S. Kaplan, SVP and CFO.Event & org in first S, but names in second SNeed to identify connection between ā€œtwo top executivesā€  and ā€œthe executivesā€.21-Mar-201140NLP & IR
Information ExtractionAnaphora resolution relies on context (semantic / pragmatic knowledge):We gave the bananas to the monkeys because they were hungry.We gave the bananas to the monkeys because they were ripe.We gave the bananas to the monkeys because they were here. 21-Mar-201141NLP & IR
WordNetSemantic database for English See also EuroWordNetBased on synsetscar = automobile | railway car | elevator car | cable car1stsynset = car, motorcar, machine, auto, automobile Connected via relationshipsE.g. hypernymy (a kind of)separate hierarchies for nouns, verbs, adjectives and adverbs21-Mar-201142NLP & IR
Word Sense DisambiguationResolve ambiguities due to polysemyOften characterised by syntax, e.g.Magnesium is a light metal (ADJ)The light in the kitchen is quite bright (NOUN)Can use a Part of Speech Tagger to resolve broad ambiguitiesPoS tags = NN, NNP, NNS, NNPS, etc.21-Mar-201143NLP & IR
Dictionary Definition of ā€œsetā€set   /sɛt/ Show Spelled [set] Show IPA verb, set, setĀ·ting, noun, adjective, interjection –verb (used with object) 1. to put (something or someone) in a particular place: to set a vase on a table. 2. to place in a particular position or posture: Set thebabyĀ on his feet. 3. to place in some relation to something or someone: We set a supervisor over the new workers. 4. to put into some condition: to set a houseĀ on fire. 5. to put or apply: to set fire to a house. 6. to put in the proper position: to set a chair back on its feet. 7. to put in the proper or desired order or condition for use: to set a trap. 8. to distribute or arrange china, silver, etc., for use on (a table): to set the table for dinner. 9. to place (the hair, especially when wet) on rollers, in clips, or the like, so that the hair will assume a particular style. 10. to put (a price or value) upon something: He set $7500 as the right amount for the car. The teacher sets a high value on neatness. 11. to fix the value of at a certain amount or rate; value: He set the car at $500. She sets neatness at a high value. 12. to post, station, or appoint for the purpose of performing some duty: to set spies on a person. 13. to determine or fix definitely: to set a timeĀ limit. 14. to resolve or decide upon: to set a wedding date. 15. to cause to pass into a given state or condition: to set one's mind at rest; to set a prisoner free. 16. to direct or settle resolutely or wishfully: to set one's mind to a task. 17. to present as a model; place before others as a standard: to set a good example. 18. to establish for others to follow: to set a fast pace. 19. to prescribe or assign, as a task. 20. to adjust (a mechanism) so as to control its performance. 21. to adjust the hands of (a clock or watch) according to a certain standard: I always set my watch by the clock in the library.
22. to adjust (a timer, alarm of a clock, etc.) so as to sound when desired: He set the alarm for seven o'clock.
23. to fix or mount (a gem or the like) in a frame or setting.
24. to ornament or stud with gemsĀ or the like: a bracelet set with pearls.
25. to cause to sit; seat: to set a child in a highchair.
26. to put (a hen) on eggs to hatch them.
27. to place (eggs) under a hen or in an incubator for hatching.
28. to place or plant firmly: to set a flagpole in concrete.
29. to put into a fixed, rigid, or settled state, as the face, muscles, etc.
30. to fix at a given point or calibration: to set the dial on an oven; to set a micrometer.
31. to tighten (often followed by up ): to set nuts well up.
32. to cause to take a particular direction: to set one's course to the south.
33. Surgery . to put (a broken or dislocated bone) back in position.
34. (of a hunting dog) to indicate the position of (game) by standing stiffly and pointing with the muzzle.
35. Music . a. to fit, as words to music.
b. to arrange for musical performance.
c. to arrange (music) for certain voices or instruments.
36. Theater . a. to arrange the scenery, properties, lights, etc., on (a stage) for an actĀ or scene.

The Role of Natural Language Processing in Information Retrieval

  • 1.
    The Role ofNatural Language Processingin Information Retrieval Searching for Meaning in TextTony Russell-Rose, PhD21-Mar-2011
  • 2.
    ContentsIntroduction & OverviewBackground,TerminologyNLP FundamentalsParadigms & PerspectivesNLP Tools & TechniquesNLP ApplicationsText mining, Question Answering, etc.Conclusions & Further Reading21-Mar-20112NLP & IR
  • 3.
    Introduction & OverviewTheRole of Natural Language Processing in Information Retrieval21-Mar-20113NLP & IR
  • 4.
    Information RetrievalModels, techniques& frameworks for satisfying an information needCommon paradigm: document retrieval21-Mar-20114NLP & IR
  • 5.
    Document RetrievalMethods includeBoolean, vector space, probabilisticRely on index termsā€œbag of wordsā€Stop lists + stemmingBut text is ā€œunstructuredā€Information may be ā€œhiddenā€21-Mar-20115NLP & IR
  • 6.
    NLP – asolved problem?As humans we do it effortlessly ... don’t we?DRUNK GETS NINE YEARS IN VIOLIN CASEPROSTITUTES APPEAL TO POPE STOLEN PAINTING FOUND BY TREE RED TAPE HOLDS UP NEW BRIDGEDEER KILL 300,000RESIDENTS CAN DROP OFF TREESINCLUDE CHILDREN WHEN BAKING COOKIES MINERS REFUSE TO WORK AFTER DEATH Ā 21-Mar-20116NLP & IR
  • 7.
    Problems with Text(1)PolysemyOne word maps to many conceptse.g. batSynonymyOne concept maps to many words21-Mar-20117NLP & IR
  • 8.
    Problems with Text(2)Word orderVenetian blind vs. Blind venetian21-Mar-20118NLP & IR
  • 9.
    Problems with Text(3)Language is generativeStarbucks coffee is the bestThe place I like most when I need to feed my caffeine addiction is the company from Seattle with branches everywhereMany different ways to express a given ideaSynonymy, paraphrase, metaphor, etc.21-Mar-20119NLP & IR
  • 10.
    Problems with Text(4)Frege's principle:
  • 11.
    The meaning ofa sentence is completely determined by the meaning of its symbols and the syntax used to combine themLanguage is a form of communicationAll communication has a context:time and place of utterance, the writer, the reader, their backgroundknowledge, intentions, assumptions re the reader’s knowledge/intentions, etc.21-Mar-201110NLP & IR
  • 12.
    Problems with Text(5)Language is changingI want to buy a mobileIll-formed inputā€œaccomodation officeā€Co-ordination, negation, etc.This is not a talk about neuro-linguistic programmingMulti-lingualityClaudia Schiffer is on the cover of ElleAlso sarcasm, irony, slang, jargon, etc.That was a wicked lectureYep – the coffee break was the best part21-Mar-201111NLP & IR
  • 13.
    Enter NLP /Text Analytics…Text Analytics: a set of linguistic, analytical and predictive techniques to extract structure and meaning from unstructured documentsNLP: the academic term for Text Analyticsanalogous to ā€œsearchā€ vs. ā€œIRā€Text Analytics ~= Natural Language Processing ~= Text MiningText Mining -> Scientific / technical context, automated processingText Analytics -> Business context, interactive apps21-Mar-201112NLP & IR
  • 14.
    NLP FundamentalsThe Roleof Natural Language Processing in Information Retrieval21-Mar-201113NLP & IR
  • 15.
    NLP PerspectivesComputational LinguisticsUseof computational techniques to study linguistic phenomenaCognitive scienceStudy of human information processing (perception, language, reasoning, etc.)Computer scienceTheoretical foundations of computation and practical techniques for implementation Information scienceAnalysis, classification, manipulation, retrieval and dissemination of information21-Mar-201114NLP & IR
  • 16.
    NLP ParadigmsSymbolic approachesRule-based,hand coded (by linguists)Knowledge-intensiveStatistical approachesShallow statistical models, trained on annotated corporaData-intensive21-Mar-201115NLP & IR
  • 17.
    NLP Fundamentals (wordlevel)Language is AMBIGUOUSTo find structure, we must remove ambiguity!Lexical analysis (tokenisation)The cat sat on the matI can’t tokenise this sentenceStop word removalNo definitive listThe Who, The The, Take That…To be or not to be21-Mar-201116NLP & IR
  • 18.
    NLP Fundamentals (wordlevel)Stemmingfishing, fished, fish, fisher -> fishLemmatizationStemming + contextPassing -> pass + INGWere -> be + PASTDelegate = de-leg-ate (?)Ratify = rat-ify (?)Morphology (prefixes, suffixes, etc.)GebƤudereinigungsfirmenangestellter -> GebƤude + Reinigung + Firma + Angestellter (building + cleaning + company + employee)21-Mar-201117NLP & IR
  • 19.
    NLP Fundamentals (sentencelevel)Syntax (part of speech tagging)book -> NOUN, VERBthat -> DETERMINERflight -> NOUNBook that flight -> VB DT NNAmbiguity problemTime flies like an arrow -> ?Fruit flies like a banana -> ?Eats shoots and leaves -> ?21-Mar-201118NLP & IR
  • 20.
    NLP Fundamentals (sentencelevel)Parsing (grammar)I saw a venetian blindI saw a blind venetianI saw the man on the hill with a telescopeRugby is a game played by men with odd-shaped ballsSentence boundary detectionPunctuation denotes the end of a sentence!ā€œBut not always!ā€, said Fred...21-Mar-201119NLP & IR
  • 21.
    NLP Tools &TechniquesThe Role of Natural Language Processing in Information Retrieval21-Mar-201120NLP & IR
  • 22.
    Early NLP SystemsELIZAWiezenbaum1966Simple pattern matchingPARRYColby et al 1971Pattern matching & planningSHRDLUWinograd 1972Natural language understandingComprehensive grammar of English21-Mar-201121NLP & IR
  • 23.
    Word PredictionAssistive technologiesAuroraSystems, TextHelpGoogle, Bing, Yahoo query suggestions21-Mar-201122NLP & IR
  • 24.
  • 25.
    Text CategorizationNews agencies:classifyincoming news storiesSearch engines:classify queries, e.g. search Google for ā€˜LOG313’Identifying spam emails, e.g.http://www.paulgraham.com/spam.htmlRouting email or documents to appropriate people21-Mar-201124NLP & IR
  • 26.
    Terminology ExtractionDifferentiate betweenuseful index terms and ā€˜noise’Help lexicographers identify new terminology:the C4 carbons of the nicotinamide ringsX-ray therapy vs. X-ray therapiesPAHO vs. Pan-American Health OrganizationTerm extraction systems process scientific papers to identify terminology, possibly comparing it with a known list21-Mar-201125NLP & IR
  • 27.
    Speech RecognitionSpoken DialogueSystemsDirectory enquiries (AT&T)Railways (Germany, Switzerland, Italy)Airlines (USA)iPhone Voice SearchIBM Watson21-Mar-201126NLP & IR
  • 28.
    Named Entity RecognitionIdentificationof key concepts, e.g. people, places, organisations, etc.Also postcodes, temporal/numerical expressions, etc.ā€œMexico has been trying to stage a recovery since the beginning of this year and it's always been getting ahead of itself in terms of fundamentals,ā€ said Matthew Hickman of Lehman Brothers in New York.ā€21-Mar-201127NLP & IR
  • 29.
    Named Entity Recognition:ApplicationsIncrease precision of IRNew companies in York vs. Companies in New YorkSupport navigationSMART tags, Apture, etc.Improve machine translation:ā€œI live in a new house in new yorkā€Speech synthesis, auto-summarisation, etc.21-Mar-201128NLP & IR
  • 30.
    Named Entity RecognitionSystemsusually developed for specific domainAre typically language and domain-specificAccuracy > 90% for newswire textBest system at MUC-7 scored 93.39% f-measurehuman annotators scored 97.60% and 96.95%.Lower for certain domains, e.g. life sciences (species, substances, proteins, genes, etc.) extended naming conventions, extensive use of conjunction and disjunction, high occurrence of neologisms and abbreviations, etc.21-Mar-201129NLP & IR
  • 31.
  • 32.
  • 33.
  • 34.
    SummarisationSection 3 discussestwo information access applications (text mining and question answering) closely associated with NLP. NLP TechniquesNamed entity recognition Information Extraction Current document retrieval technologies could not identify information as specific as this within text. Word Sense DisambiguationText MiningQuestion Answering ā€œWhā€ questionsList questions Examples include those mentioned here: text mining, question answering and cross-language information retrieval. Ponte and Croft (1998) ā€œA Language Modeling Approach to Information Retrievalā€ SIGIR.Ā Sanderson, M. (1994) ā€œWord sense disambiguation and information retrievalā€ Proceedings of the 17th ACM SIGIR ConferenceĀ Sanderson, M. (2000) ā€œRetrieving with Good Senseā€ Information Retrieval 2(1):49-69.Ā Question Answering (Section 3.2).Ā Summarizationsingle-document vs. multi-documentMS WordBing21-Mar-201133NLP & IR
  • 35.
  • 36.
    Based on pre-definedstructures, e.g.movements of company executivesvictims of terrorist attackscorporate mergers / acquisitionsinteractions between genes and proteinsExample (from MUC-6):Now, Mr. James is preparing to sail into the sunset, and Mr. Dooner is poised to rev up the engines to guide Interpublic Group's McCann-Erickson into the 21st century. Yesterday, McCann made official what had been widely anticipated: Mr. James, 57 years old, is stepping down as chief executive officer on July 1 and will retire as chairman at the end of the year. He will be succeeded by Mr. Dooner, 45.21-Mar-201134NLP & IR
  • 37.
    Information Extraction<SUCCESSION_EVENT-2> := SUCCESSION_ORG: <ORGANIZATION-1> POST: "chairman" IN_AND_OUT: <IN_AND_OUT-4> VACANCY_REASON: DEPART_WORKFORCEĀ <IN_AND_OUT-4> := IO_PERSON: <PERSON-1> NEW_STATUS: IN ON_THE_JOB: NO OTHER_ORG: <ORGANIZATION-1> REL_OTHER_ORG: SAME_ORGĀ <ORGANIZATION-1> := ORG_NAME: "McCann-Erickson" ORG_ALIAS: "McCann" ORG_TYPE: COMPANYĀ <PERSON-1> := PER_NAME: "John J. Dooner Jr." PER_ALIAS: "John Dooner" "Dooner"21-Mar-201135NLP & IR
  • 38.
    Information ExtractionCan nowuse as metadata (for retrieval)Or store in DB and query against it, e.g.ā€œshow me all the events where a finance officer left a company to take up the position of CEO in another companyā€21-Mar-201136NLP & IR
  • 39.
    IE Example: YahooCorrelator21-Mar-201137NLP & IR
  • 40.
    IE Example: GoogleWonder Wheel21-Mar-201138NLP & IR
  • 41.
    IE Example: YahooSearch Assist21-Mar-201139NLP & IR
  • 42.
    Information ExtractionBut IEis difficult when concepts are spread across multiple sentences:Pace American Group Inc. said it notified two top executives it intends to dismiss them because an internal investigation found evidence of ā€œself-dealingā€ and ā€œundisclosed financial relationships.ā€ The executives are Don H. Pace, president and CEO; and Greg S. Kaplan, SVP and CFO.Event & org in first S, but names in second SNeed to identify connection between ā€œtwo top executivesā€ and ā€œthe executivesā€.21-Mar-201140NLP & IR
  • 43.
    Information ExtractionAnaphora resolutionrelies on context (semantic / pragmatic knowledge):We gave the bananas to the monkeys because they were hungry.We gave the bananas to the monkeys because they were ripe.We gave the bananas to the monkeys because they were here. 21-Mar-201141NLP & IR
  • 44.
    WordNetSemantic database forEnglish See also EuroWordNetBased on synsetscar = automobile | railway car | elevator car | cable car1stsynset = car, motorcar, machine, auto, automobile Connected via relationshipsE.g. hypernymy (a kind of)separate hierarchies for nouns, verbs, adjectives and adverbs21-Mar-201142NLP & IR
  • 45.
    Word Sense DisambiguationResolveambiguities due to polysemyOften characterised by syntax, e.g.Magnesium is a light metal (ADJ)The light in the kitchen is quite bright (NOUN)Can use a Part of Speech Tagger to resolve broad ambiguitiesPoS tags = NN, NNP, NNS, NNPS, etc.21-Mar-201143NLP & IR
  • 46.
    Dictionary Definition ofā€œsetā€set   /sɛt/ Show Spelled [set] Show IPA verb, set, setĀ·ting, noun, adjective, interjection –verb (used with object) 1. to put (something or someone) in a particular place: to set a vase on a table. 2. to place in a particular position or posture: Set thebabyĀ on his feet. 3. to place in some relation to something or someone: We set a supervisor over the new workers. 4. to put into some condition: to set a houseĀ on fire. 5. to put or apply: to set fire to a house. 6. to put in the proper position: to set a chair back on its feet. 7. to put in the proper or desired order or condition for use: to set a trap. 8. to distribute or arrange china, silver, etc., for use on (a table): to set the table for dinner. 9. to place (the hair, especially when wet) on rollers, in clips, or the like, so that the hair will assume a particular style. 10. to put (a price or value) upon something: He set $7500 as the right amount for the car. The teacher sets a high value on neatness. 11. to fix the value of at a certain amount or rate; value: He set the car at $500. She sets neatness at a high value. 12. to post, station, or appoint for the purpose of performing some duty: to set spies on a person. 13. to determine or fix definitely: to set a timeĀ limit. 14. to resolve or decide upon: to set a wedding date. 15. to cause to pass into a given state or condition: to set one's mind at rest; to set a prisoner free. 16. to direct or settle resolutely or wishfully: to set one's mind to a task. 17. to present as a model; place before others as a standard: to set a good example. 18. to establish for others to follow: to set a fast pace. 19. to prescribe or assign, as a task. 20. to adjust (a mechanism) so as to control its performance. 21. to adjust the hands of (a clock or watch) according to a certain standard: I always set my watch by the clock in the library.
  • 47.
    22. to adjust(a timer, alarm of a clock, etc.) so as to sound when desired: He set the alarm for seven o'clock.
  • 48.
    23. to fixor mount (a gem or the like) in a frame or setting.
  • 49.
    24. to ornamentor stud with gemsĀ or the like: a bracelet set with pearls.
  • 50.
    25. to causeto sit; seat: to set a child in a highchair.
  • 51.
    26. to put(a hen) on eggs to hatch them.
  • 52.
    27. to place(eggs) under a hen or in an incubator for hatching.
  • 53.
    28. to placeor plant firmly: to set a flagpole in concrete.
  • 54.
    29. to putinto a fixed, rigid, or settled state, as the face, muscles, etc.
  • 55.
    30. to fixat a given point or calibration: to set the dial on an oven; to set a micrometer.
  • 56.
    31. to tighten(often followed by up ): to set nuts well up.
  • 57.
    32. to causeto take a particular direction: to set one's course to the south.
  • 58.
    33. Surgery .to put (a broken or dislocated bone) back in position.
  • 59.
    34. (of ahunting dog) to indicate the position of (game) by standing stiffly and pointing with the muzzle.
  • 60.
    35. Music .a. to fit, as words to music.
  • 61.
    b. to arrangefor musical performance.
  • 62.
    c. to arrange(music) for certain voices or instruments.
  • 63.
    36. Theater .a. to arrange the scenery, properties, lights, etc., on (a stage) for an actĀ or scene.
  • 64.
    b. to prepare(a scene) for dramatic performance.
  • 65.
    37. Nautical .to spread and secure (a sail) so as to catch the wind.
  • 66.
    38. Printing .a. to arrange (type) in the order required for printing.
  • 67.
    b. to puttogether types corresponding to (copy); compose in type: to set an article.
  • 68.
    39. Baking .to put aside (a substance to which yeast has been added) in order that it may rise.
  • 69.
    40. to changeinto curd: to set milk with rennet.
  • 70.
    41. to cause(glue, mortar, or the like) to become fixed or hard.
  • 71.
    42. to urge,goad, or encourage to attack: to set the hounds on a trespasser.
  • 72.
    43. Bridge .to cause (the opposing partnership or their contract) to fall short: We set them two tricks at four spades. Only perfect defense could set four spades.
  • 73.
    44. to affixor apply, as by stamping: The king set his seal to the decree.
  • 74.
    45. to fixor engage (a fishhook) firmly into the jaws of a fishĀ by pulling hard on the line once the fish has taken the bait.
  • 75.
    46. to sharpenor put a keen edge on (a blade, knife, razor, etc.) by honing or grinding.
  • 76.
    47. to fixthe length, width, and shape of (yarn, fabric, etc.).
  • 77.
    48. Carpentry .to sink (a nail head) with a nail set.
  • 78.
    49. to bendor form to the proper shape, as a saw tooth or a spring.
  • 79.
    50. to bendthe teeth of (a saw) outward from the blade alternately on both sides in order to make a cut wider than the blade itself.
  • 80.
    –verb (used withoutobject) 51. to pass below the horizon; sink: The sunĀ sets early in winter.
  • 81.
  • 82.
    53. to assumea fixed or rigid state, as the countenance or the muscles.
  • 83.
    54. (of thehair) to be placed temporarily on rollers, in clips, or the like, in order to assume a particular style: Long hair sets more easily than short hair.
  • 84.
    55. to becomefirm, solid, or permanent, as mortar, glue, cement, or a dye, due to drying or physical or chemical change.
  • 85.
    56. to siton eggs to hatch them, as a hen.
  • 86.
    57. to hangor fit, as clothes.
  • 87.
    58. to beginto move; start (usually followed by forth, out, off, Ā etc.).
  • 88.
    59. (of aflower's ovary) to develop into a fruit.
  • 89.
    60. (of ahunting dog) to indicate the position of game. 21-Mar-201144NLP & IR
  • 90.
    Dictionary Definition ofā€œsetā€61. to have a certain direction or course, as a wind, current, or the like. 62. Nautical . (of a sail) to be spread so as to catch the wind. 63. Printing . (of type) to occupy a certain width: This copy sets to forty picas. 64. Nonstandard . sit: Come in and set a spell. –noun 65. the act or state of setting or the state of being set. 66. a collection of articles designed for use together: a set of china; a chessĀ set. 67. a collection, each member of which is adapted for a special use in a particular operation: a set of golf clubs; a set of carving knives. 68. a number, group, or combination of things of similar nature, design, or function: a set of ideas. 69. a series of volumes by one author, about one subject, etc. 70. a number, company, or group of persons associated by common interests, occupations, conventions, or status: a set of murderous thieves; the smart set. 71. the fit, as of an article of clothing: the set of his coat. 72. fixed direction, bent, or inclination: The set of his mind was obvious. 73. bearing or carriage: the set of one's shoulders. 74. the assumptionĀ of a fixed, rigid, or hard state, as by mortar or glue. 75. the arrangement of the hair in a particular style: How much does the beautyĀ parlor charge for a shampoo and set? 76. a plate for holding a tool or die.
  • 91.
    77. an apparatusfor receiving radioĀ or television programs; receiver.
  • 92.
    78. Philately .a group of stamps that form a complete series.
  • 93.
    79. Tennis .a unit of a match, consisting of a group of not fewer than sixgamesĀ with a margin of at least two games between the winner and loser: He won the match in straight sets of 6–3, 6–4, 6–4.
  • 94.
    80. a constructionrepresenting a place or scene in which the action takes place in a stage, motion-picture, or television production.
  • 95.
    81. Machinery .a. the bending out of the points of alternate teeth of a saw in opposite directions.
  • 96.
    b. a permanentdeformation or displacement of an object or part.
  • 97.
    c. a toolfor giving a certain form to something, as a saw tooth.
  • 98.
    82. a chiselhaving a wide blade for dividing bricks.
  • 99.
    83. Horticulture .a young plant, or a slip, tuber, or the like, suitable for planting.
  • 100.
    84. Dance .a. the number of couples required to execute a quadrille or the like.
  • 101.
    b. a seriesof movements or figures that make up a quadrille or the like.
  • 102.
    85. Music .a. a group of pieces played by a band, as in a night club, and followed by an intermission.
  • 103.
    b. the periodduring which these pieces are played.
  • 104.
    86. Bridge .a failure to take the number of tricks specified by one's contract: Our being vulnerable made the set even more costly.
  • 105.
    87. Nautical .a. the direction of a wind, current, etc.
  • 106.
    b. the formor arrangement of the sails, spars, etc., of a vessel.
  • 107.
    c. suit (def. 12 ) .
  • 108.
    88. Psychology .a temporary state of an organism characterized by a readiness to respond to certain stimuli in a specific way.
  • 109.
    89. Mining .a timber frame bracing or supporting the walls or roof of a shaft or stope.
  • 110.
  • 111.
  • 112.
    91. Mathematics .a collection of objects or elements classed together.
  • 113.
    92. Printing .the width of a body of type.
  • 114.
    93. sett (def. 3 ) .
  • 115.
    –adjective 94. fixedor prescribed beforehand: a set time; set rules.
  • 116.
    95. specified; fixed:The hall holds a set number of people.
  • 117.
    96. deliberately composed;customary: set phrases.
  • 118.
    97. fixed; rigid:a set smile.
  • 119.
    98. resolved ordetermined; habitually or stubbornly fixed: to be set in one's opinions.
  • 120.
    99. completely prepared;ready: Is everyone set?
  • 121.
    –interjection 100. (incalling the start of a race): Ready! Set! Go! Also, get set! 21-Mar-201145NLP & IR
  • 122.
    Word Sense DisambiguationSurelyPoS tagging would help IR performance?Could index docs using words AND PoS tagsVery sensitive to tagger performanceDoes WSD improve IR performance?Conflicting evidence ā€¦ā€œPerfectā€ WSD engine may provide only marginal gain, e.g.>76% accuracy needed for homonymy>55% for polysemy21-Mar-201146NLP & IR
  • 123.
    NLP ApplicationsThe Roleof Natural Language Processing in Information Retrieval21-Mar-201147NLP & IR
  • 124.
    NLP ApplicationsWeb &enterprise searchText classificationText summarisationMachine translationHuman-computer interfaces (spelling correction etc.)Speech recognition & synthesisNatural language generation (e.g. in games)Text miningQuestion answering21-Mar-201148NLP & IR
  • 125.
    Finally becoming mainstream?ā€˜80%of corporate information is unstructured’Entire value chain for some organisation (media / publishing etc.)Retail / eCommerce: Product reviews User generated content: blogs, forums, wikisVoice of the Customer: social media + sentiment analysis161 billion gigabytes of digital information in 2006approximately 988 exabytes by 2010 Audio / video still needs summaries & tags etc.21-Mar-201149NLP & IR
  • 126.
    Market OutlookCombine componenttechnologiese.g. PoS tagging + NER, etc.Healthy growth: massive expansion in social mediaLightweight NLP for buzz analysis, brand monitoring, etc.Dominant markets:Customer Experience (VoC), media/publishing, financial services & insurance, intelligence, life sciences, e-discoverySolutions still not standardizedNeed for self-service tuning & configurationPartner ecosystem developingMarketing services providers, platform vendors, CRM + call centre vendors, system integrators21-Mar-201150NLP & IR
  • 127.
    Text MiningAnalogy withData MiningDiscover or infer new knowledge from unstructured text resourcesA <-> B and B <-> CInfer A <-> C?e.g. link between migraine headaches and magnesium deficiencyApplications in life sciences, media/publishing, counter terrorism and competitive intelligence21-Mar-201151NLP & IR
  • 128.
    Text MiningLevels oftext mining, e.g. for bioscience:Tagging entities: e.g. genes, proteins, diseases and chemical compoundsInformation extraction: identify meaningful structural relations within the text (e.g. interactions between proteins)Knowledge discovery: combine with other knowledge sources (such as external ontologies) to infer or discover new knowledge21-Mar-201152NLP & IR
  • 129.
    Question AnsweringGoing beyondthe document retrieval paradigmprovide specific answers to specific questionsIntroduced as a TREC track in 199921-Mar-201153NLP & IR
  • 130.
    Question AnsweringYes/no questionsā€œIs George W. Bush the current president of the USA?ā€ā€œIs the Sea of Tranquility deep?ā€ā€œWhā€ questionsā€œWho was the British Prime Minister before Margaret Thatcher?ā€ā€œWhen was the Battle of Hastings?ā€List questions ā€œWhich football teams have won the Champions League this decade?ā€ā€œWhich roads lead to Rome?ā€Instruction-based questions ā€œHow do I cook lasagne?ā€,ā€œWhat is the best way to build a bridge?ā€Explanation questions ā€œWhy did World War I start?ā€ā€œHow does a computer process floating point numbers?ā€Commands ā€œTell me the height of the Eiffel Tower.ā€ā€œName all the Kings of Englandā€21-Mar-201154NLP & IR
  • 131.
    Question AnsweringCommon usecases handled by major search enginesGoogle, Yahoo!, Bing, Baidu, ...Specialist QA systemsWolfram Alpha: http://www.wolframalpha.comTrue Knowledge: http://www.trueknowledge.comPowerset: http://www.powerset.com/IBM Watson21-Mar-201155NLP & IR
  • 132.
    Question AnsweringBasic approach:questionanalysisPredict type of answer expectedCreate querydocument retrievalanswer extractionBased on output of (1)Varies from regex to deep linguistic analysis21-Mar-201156NLP & IR
  • 133.
    EvaluationHow do wemeasure NLP accuracy?Compare against human annotationHowever:Humans often disagreeParticularly for complex tasksComparison can be measured in many waysā€œBill Gates is chairman of Microsoftā€ -> ā€œGatesā€ = person?21-Mar-201157NLP & IR
  • 134.
    ConclusionsThe Role ofNatural Language Processing in Information Retrieval21-Mar-201158NLP & IR
  • 135.
    ConclusionsNLP’s contribution toIR not always apparentOveremphasis on precision & recall?NLP provides benefits that go beyond quantitative metrics to support broader tasks & richer experiencesInsight, analysis, visualisation, etc.NLP required for text mining, question answering & cross-language IRBag of words model is insufficientWeb becoming increasingly multi-lingual21-Mar-201159NLP & IR
  • 136.
    State of theArtCommodity technologies:Tokenisation, normalization, regular expression searchMainstream, but room for improvement:Lemmatization, spelling correction, speech synthesisMainstream but needs improvement:PoS tagging, term extraction, summarization, speech recognition, text categorization, spoken dialogue systemsAlmost there:Information extraction, NL generation, simple speech translationDon’t hold your breath:Full machine translation, advanced dialogue systems21-Mar-201160NLP & IR
  • 137.
    Platforms & ToolkitsManycommercialvendorsOpen source/access:GATE (Sheffield University)RapidMinerStanford NLPOpenNLPOpenCalaisLingPipenltk21-Mar-201161NLP & IR
  • 138.
    Further ReadingA. Goker& J. Davies, Information Retrieval: Searching in the 21st Century. Wiley-Blackwell, 2009D. Jurafsky and J. Martin, Speech and Language Processing. Prentice-Hall, 2nd edition, 2008.Manning, C. and Schutze, H. Foundations of Statistical Language Processing. MIT Press, 1999. 21-Mar-201162NLP & IR
  • 139.
    Thank you Tony Russell-Rose, PhD Vice-chair, BCS IRSG Chair, IEHF HCI GroupEmail: tgr@uxlabs.co.ukBlog: http://isquared.wordpress.comLinkedIn: http://www.linkedin.com/tonyrussellrose/Twitter: @tonygrr

Editor's Notes

  • #17Ā Exercise: experiment with the 3 major search engines (Google, Bing, Yahoo) to see how they handle stop words. Try queries that are dominated by stop words, with and without quotes. How do they differ?
  • #18Ā Exercise: experiment with the 3 major search engines (Google, Bing, Yahoo) to see how they handle stemming. Try queries that are dominated by inflected forms, with and without quotes. How do they differ?
  • #19Ā Do as class exercise
  • #22Ā Demo: ELIZA http://nlp-addiction.com/eliza/
  • #23Ā Demo Google suggest
  • #24Ā Demo Auto-correct &amp; DYM
  • #33Ā Demo: try Google Translate (or Babelfish) with sentences containing ambiguous terms or NEs, e.g. ā€œI live in a new house in new yorkā€. Use the speech synthesizerhttp://translate.google.com/#en|de|I%20live%20in%20a%20new%20house%20in%20new%20york
  • #40Ā Demo Yahoo search assist + ā€œexplore conceptsā€
  • #44Ā Paired exercise: what is the most polysemous English word you can think of? How many senses do you think it has in an average dictionary? Think of some polysemous words (examples include ā€œballā€, ā€œbankā€, ā€œportā€ but there are many others). Use these as queries in a search engine and comment on what you notice about the results.
  • #55Ā Paired Exercise: try each of above queries in Google, Yahoo, Bing. Now try to answer the same questions with a normal web search (by entering a set of query terms instead of a natural language question). Are the results of this search any better or worse?
  • #56Ā Paired exercise: try previous queries on these sites