• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Textmining Information Extraction

Textmining Information Extraction



Textmining Information Extraction

Textmining Information Extraction



Total Views
Views on SlideShare
Embed Views



3 Embeds 7

http://www.slideshare.net 3
http://dataminingtools.net 3
http://www.dataminingtools.net 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Textmining Information Extraction Textmining Information Extraction Presentation Transcript

    • Text Mining:Information extraction
    • Goals of information extraction
      “Processing of natural language texts for the extraction of relevant content pieces” (MARTÍ AND CASTELLÓN, 2000)
      Raw texts => structured databases
      Templates filling
      Improving search engines
      Auxiliary tool for other language applications
    • Name Entity Recognition
      Named Entities are proper names in texts, i.e. the
      names of persons, organizations, locations, times and
      NER is the task of processing a text and identifying
      named entities.
    • Why is Named Entity Recognition difficult?
      -Names too numerous to include in dictionaries
      e.g. John Smith, Mr Smith, John
      -Changing constantly
      new names invent unknown words
      For some proper nouns it is hard to determine the
    • Example
      Delimit the named entities in a text and tag them withNE
      – entity names - ENAMEX
      – temporal expressions - TIMEX
      – number expressions - NUMEX
      Subcategories of tags
      – captured by a SGML tag attribute called TYPE
    • Example
      Original text:
      The U.K. satellite television broadcaster said its subscriber base grew 17.5 percent during the past year to 5.35 million
      • Tagged text:
      satellite television broadcaster said its subscriber base grew
      <NUMEX TYPE="PERCENT">17.5 percent</NUMEX>
      during <TIMEX TYPE="DATE">the past year</TIMEX> to
      5.35 million
    • Maximum Entropy for NER
      Use the probability distribution that has maximum entropy, or that is maximally uncertain, from those that are consistent with observed evidence
      • P = {models consistent with evidence}
      • H(p) = entropy of p
      • PME = argmax p∈P H(p)
    • Maximum Entropy for NER
      Given a set of answer candidates
      Model the probability
      Define Features Functions
      Decision Rule
    • Template Filling
      A template is a frame (of a record structure), consisting
      of slots and fillers. A template denotes an event or a
      semantic concept.
      After extracting NEs, relations and events, IE fills an
      appropriate template
    • Template filling techniques
      Two common approaches for templatefilling:
      – Statistical approach
      – Finite-state cascade approach
    • Again, by using a sequence labeling method:
      Label sequences of tokens as potential fillers for a particular slot
      Train separate sequence classifiers for each slot
      Slots are filled with the text segments identified by each slot’s
      corresponding classifier
      Statistical Approach
    • Statistical Approach
      – Resolve multiple labels assigned to the same/overlapping text
      segment by adding weights (heuristic confidence) to the slots
      – State-of-the-art performance – F1-measure of 75 to 98
      However, those methods are shown to be effective only
      for small, homogenous data
    • Finite-State Template-Filling Systems
      Message Understanding Conferences (MUC) – the
      genesis of IE
      DARPA funded significant efforts in IE in the early to mid 1990’s.
      MUC was an annual event/competition where results were
    • Finite-State Template-Filling Systems
      – Focused on extracting information from news articles:
      • Terrorist events (MUC-4, 1992)
      • Industrial joint ventures (MUC-5, 1993)
      • Company management changes
      – Informationextraction of particular interest to the intelligence
      community (CIA, NSA). (Note: early ’90’s)
    • Applications
      It has a wide range of application in
      search engines
      biomedical field
      Customer profile analysis
      Trend analysis
      Information filtering and routing
      Event tracks
      news stories classification
    • conclusion
      In this presentation we studied about
      Goals of information extraction
      Entity Extraction: The Maximum Entropy method
      Template filling
    • Visit more self help tutorials
      Pick a tutorial of your choice and browse through it at your own pace.
      The tutorials section is free, self-guiding and will not involve any additional support.
      Visit us at www.dataminingtools.net