Text Mining:Information extraction
Goals of information extraction“Processing of natural language texts for the extraction of relevant content pieces” (MARTÍ AND CASTELLÓN, 2000)Raw texts => structured databasesTemplates fillingImproving search enginesAuxiliary tool for other language applications
Name Entity RecognitionNamed Entities are proper names in texts, i.e. thenames of persons, organizations, locations, times andquantities. NER is the task of processing a text and identifyingnamed entities.
Why is Named Entity Recognition difficult?-Names too numerous to include in dictionaries -Variationse.g. John Smith, Mr Smith, John -Changing constantlynew names invent unknown words -AmbiguityFor some proper nouns it is hard to determine thecategoryName
ExampleDelimit the named entities in a text and tag them withNECategories:      – entity names - ENAMEX      – temporal expressions - TIMEX       – number expressions - NUMEXSubcategories of tags     – captured by a SGML tag attribute called TYPE
ExampleOriginal text:      The U.K. satellite television broadcaster said its subscriber base grew 17.5 percent during the past year to 5.35 million• Tagged text:The <ENAMEX TYPE="LOCATION">U.K.</ENAMEX>satellite television broadcaster said its subscriber base grew<NUMEX TYPE="PERCENT">17.5 percent</NUMEX>during <TIMEX TYPE="DATE">the past year</TIMEX> to5.35 millionExample
Maximum Entropy for NERUse the probability distribution that has maximum entropy, or that is maximally uncertain, from those that are consistent with observed evidence• P = {models consistent with evidence}• H(p) = entropy of p• PME = argmax p∈P H(p)
Maximum Entropy for NERGiven a set of answer candidatesModel the probabilityDefine Features FunctionsDecision Rule
Template Filling A template is a frame (of a record structure), consistingof slots and fillers. A template denotes an event or asemantic concept. After extracting NEs, relations and events, IE fills anappropriate template
Template filling techniquesTwo common approaches for templatefilling:– Statistical approach– Finite-state cascade approach
Again, by using a sequence labeling method:Label sequences of tokens as potential fillers for a particular slot Train separate sequence classifiers for each slot Slots are filled with the text segments identified by each slot’scorresponding classifier             Statistical Approach
 Statistical Approach– Resolve multiple labels assigned to the same/overlapping textsegment by adding weights (heuristic confidence) to the slots– State-of-the-art performance – F1-measure of 75 to 98 However, those methods are shown to be effective onlyfor small, homogenous data
Finite-State Template-Filling Systems Message Understanding Conferences (MUC) – thegenesis of IE DARPA funded significant efforts in IE in the early to mid 1990’s.MUC was an annual event/competition where results werepresented.
Finite-State Template-Filling Systems– Focused on extracting information from news articles:• Terrorist events (MUC-4, 1992)• Industrial joint ventures (MUC-5, 1993)• Company management changes– Informationextraction of particular interest to the intelligencecommunity (CIA, NSA). (Note: early ’90’s)
ApplicationsIt  has a wide range of application insearch enginesbiomedical field Customer profile analysisTrend analysisInformation filtering and routingEvent tracksnews stories classification
conclusionIn  this  presentation  we  studied  aboutGoals of information extraction Entity Extraction: The Maximum Entropy method Template filling Applications
Visit more self help tutorialsPick a tutorial of your choice and browse through it at your own pace.The tutorials section is free, self-guiding and will not involve any additional support.Visit us at www.dataminingtools.net

Textmining Information Extraction

  • 1.
  • 2.
    Goals of informationextraction“Processing of natural language texts for the extraction of relevant content pieces” (MARTÍ AND CASTELLÓN, 2000)Raw texts => structured databasesTemplates fillingImproving search enginesAuxiliary tool for other language applications
  • 3.
    Name Entity RecognitionNamedEntities are proper names in texts, i.e. thenames of persons, organizations, locations, times andquantities. NER is the task of processing a text and identifyingnamed entities.
  • 4.
    Why is NamedEntity Recognition difficult?-Names too numerous to include in dictionaries -Variationse.g. John Smith, Mr Smith, John -Changing constantlynew names invent unknown words -AmbiguityFor some proper nouns it is hard to determine thecategoryName
  • 5.
    ExampleDelimit the namedentities in a text and tag them withNECategories: – entity names - ENAMEX – temporal expressions - TIMEX – number expressions - NUMEXSubcategories of tags – captured by a SGML tag attribute called TYPE
  • 6.
    ExampleOriginal text: The U.K. satellite television broadcaster said its subscriber base grew 17.5 percent during the past year to 5.35 million• Tagged text:The <ENAMEX TYPE="LOCATION">U.K.</ENAMEX>satellite television broadcaster said its subscriber base grew<NUMEX TYPE="PERCENT">17.5 percent</NUMEX>during <TIMEX TYPE="DATE">the past year</TIMEX> to5.35 millionExample
  • 7.
    Maximum Entropy forNERUse the probability distribution that has maximum entropy, or that is maximally uncertain, from those that are consistent with observed evidence• P = {models consistent with evidence}• H(p) = entropy of p• PME = argmax p∈P H(p)
  • 8.
    Maximum Entropy forNERGiven a set of answer candidatesModel the probabilityDefine Features FunctionsDecision Rule
  • 9.
    Template Filling Atemplate is a frame (of a record structure), consistingof slots and fillers. A template denotes an event or asemantic concept. After extracting NEs, relations and events, IE fills anappropriate template
  • 10.
    Template filling techniquesTwocommon approaches for templatefilling:– Statistical approach– Finite-state cascade approach
  • 11.
    Again, by usinga sequence labeling method:Label sequences of tokens as potential fillers for a particular slot Train separate sequence classifiers for each slot Slots are filled with the text segments identified by each slot’scorresponding classifier Statistical Approach
  • 12.
    Statistical Approach–Resolve multiple labels assigned to the same/overlapping textsegment by adding weights (heuristic confidence) to the slots– State-of-the-art performance – F1-measure of 75 to 98 However, those methods are shown to be effective onlyfor small, homogenous data
  • 13.
    Finite-State Template-Filling SystemsMessage Understanding Conferences (MUC) – thegenesis of IE DARPA funded significant efforts in IE in the early to mid 1990’s.MUC was an annual event/competition where results werepresented.
  • 14.
    Finite-State Template-Filling Systems–Focused on extracting information from news articles:• Terrorist events (MUC-4, 1992)• Industrial joint ventures (MUC-5, 1993)• Company management changes– Informationextraction of particular interest to the intelligencecommunity (CIA, NSA). (Note: early ’90’s)
  • 15.
    ApplicationsIt hasa wide range of application insearch enginesbiomedical field Customer profile analysisTrend analysisInformation filtering and routingEvent tracksnews stories classification
  • 16.
    conclusionIn this presentation we studied aboutGoals of information extraction Entity Extraction: The Maximum Entropy method Template filling Applications
  • 17.
    Visit more selfhelp tutorialsPick a tutorial of your choice and browse through it at your own pace.The tutorials section is free, self-guiding and will not involve any additional support.Visit us at www.dataminingtools.net