• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Textmining Information Extraction
 

Textmining Information Extraction

on

  • 1,300 views

Textmining Information Extraction

Textmining Information Extraction

Statistics

Views

Total Views
1,300
Views on SlideShare
1,293
Embed Views
7

Actions

Likes
0
Downloads
0
Comments
0

3 Embeds 7

http://www.slideshare.net 3
http://dataminingtools.net 3
http://www.dataminingtools.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Textmining Information Extraction Textmining Information Extraction Presentation Transcript

    • Text Mining:Information extraction
    • Goals of information extraction
      “Processing of natural language texts for the extraction of relevant content pieces” (MARTÍ AND CASTELLÓN, 2000)
      Raw texts => structured databases
      Templates filling
      Improving search engines
      Auxiliary tool for other language applications
    • Name Entity Recognition
      Named Entities are proper names in texts, i.e. the
      names of persons, organizations, locations, times and
      quantities.
      NER is the task of processing a text and identifying
      named entities.
    • Why is Named Entity Recognition difficult?
      -Names too numerous to include in dictionaries
      -Variations
      e.g. John Smith, Mr Smith, John
      -Changing constantly
      new names invent unknown words
      -Ambiguity
      For some proper nouns it is hard to determine the
      category
      Name
    • Example
      Delimit the named entities in a text and tag them withNE
      Categories:
      – entity names - ENAMEX
      – temporal expressions - TIMEX
      – number expressions - NUMEX
      Subcategories of tags
      – captured by a SGML tag attribute called TYPE
    • Example
      Original text:
      The U.K. satellite television broadcaster said its subscriber base grew 17.5 percent during the past year to 5.35 million
      • Tagged text:
      The <ENAMEX TYPE="LOCATION">U.K.</ENAMEX>
      satellite television broadcaster said its subscriber base grew
      <NUMEX TYPE="PERCENT">17.5 percent</NUMEX>
      during <TIMEX TYPE="DATE">the past year</TIMEX> to
      5.35 million
      Example
    • Maximum Entropy for NER
      Use the probability distribution that has maximum entropy, or that is maximally uncertain, from those that are consistent with observed evidence
      • P = {models consistent with evidence}
      • H(p) = entropy of p
      • PME = argmax p∈P H(p)
    • Maximum Entropy for NER
      Given a set of answer candidates
      Model the probability
      Define Features Functions
      Decision Rule
    • Template Filling
      A template is a frame (of a record structure), consisting
      of slots and fillers. A template denotes an event or a
      semantic concept.
      After extracting NEs, relations and events, IE fills an
      appropriate template
    • Template filling techniques
      Two common approaches for templatefilling:
      – Statistical approach
      – Finite-state cascade approach
    • Again, by using a sequence labeling method:
      Label sequences of tokens as potential fillers for a particular slot
      Train separate sequence classifiers for each slot
      Slots are filled with the text segments identified by each slot’s
      corresponding classifier
      Statistical Approach
    • Statistical Approach
      – Resolve multiple labels assigned to the same/overlapping text
      segment by adding weights (heuristic confidence) to the slots
      – State-of-the-art performance – F1-measure of 75 to 98
      However, those methods are shown to be effective only
      for small, homogenous data
    • Finite-State Template-Filling Systems
      Message Understanding Conferences (MUC) – the
      genesis of IE
      DARPA funded significant efforts in IE in the early to mid 1990’s.
      MUC was an annual event/competition where results were
      presented.
    • Finite-State Template-Filling Systems
      – Focused on extracting information from news articles:
      • Terrorist events (MUC-4, 1992)
      • Industrial joint ventures (MUC-5, 1993)
      • Company management changes
      – Informationextraction of particular interest to the intelligence
      community (CIA, NSA). (Note: early ’90’s)
    • Applications
      It has a wide range of application in
      search engines
      biomedical field
      Customer profile analysis
      Trend analysis
      Information filtering and routing
      Event tracks
      news stories classification
    • conclusion
      In this presentation we studied about
      Goals of information extraction
      Entity Extraction: The Maximum Entropy method
      Template filling
      Applications
    • Visit more self help tutorials
      Pick a tutorial of your choice and browse through it at your own pace.
      The tutorials section is free, self-guiding and will not involve any additional support.
      Visit us at www.dataminingtools.net