Your SlideShare is downloading. ×
Textmining Information Extraction
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Textmining Information Extraction

948
views

Published on

Textmining Information Extraction

Textmining Information Extraction

Published in: Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
948
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Text Mining:Information extraction
  • 2. Goals of information extraction
    “Processing of natural language texts for the extraction of relevant content pieces” (MARTÍ AND CASTELLÓN, 2000)
    Raw texts => structured databases
    Templates filling
    Improving search engines
    Auxiliary tool for other language applications
  • 3. Name Entity Recognition
    Named Entities are proper names in texts, i.e. the
    names of persons, organizations, locations, times and
    quantities.
    NER is the task of processing a text and identifying
    named entities.
  • 4. Why is Named Entity Recognition difficult?
    -Names too numerous to include in dictionaries
    -Variations
    e.g. John Smith, Mr Smith, John
    -Changing constantly
    new names invent unknown words
    -Ambiguity
    For some proper nouns it is hard to determine the
    category
    Name
  • 5. Example
    Delimit the named entities in a text and tag them withNE
    Categories:
    – entity names - ENAMEX
    – temporal expressions - TIMEX
    – number expressions - NUMEX
    Subcategories of tags
    – captured by a SGML tag attribute called TYPE
  • 6. Example
    Original text:
    The U.K. satellite television broadcaster said its subscriber base grew 17.5 percent during the past year to 5.35 million
    • Tagged text:
    The <ENAMEX TYPE="LOCATION">U.K.</ENAMEX>
    satellite television broadcaster said its subscriber base grew
    <NUMEX TYPE="PERCENT">17.5 percent</NUMEX>
    during <TIMEX TYPE="DATE">the past year</TIMEX> to
    5.35 million
    Example
  • 7. Maximum Entropy for NER
    Use the probability distribution that has maximum entropy, or that is maximally uncertain, from those that are consistent with observed evidence
    • P = {models consistent with evidence}
    • H(p) = entropy of p
    • PME = argmax p∈P H(p)
  • 8. Maximum Entropy for NER
    Given a set of answer candidates
    Model the probability
    Define Features Functions
    Decision Rule
  • 9. Template Filling
    A template is a frame (of a record structure), consisting
    of slots and fillers. A template denotes an event or a
    semantic concept.
    After extracting NEs, relations and events, IE fills an
    appropriate template
  • 10. Template filling techniques
    Two common approaches for templatefilling:
    – Statistical approach
    – Finite-state cascade approach
  • 11. Again, by using a sequence labeling method:
    Label sequences of tokens as potential fillers for a particular slot
    Train separate sequence classifiers for each slot
    Slots are filled with the text segments identified by each slot’s
    corresponding classifier
    Statistical Approach
  • 12. Statistical Approach
    – Resolve multiple labels assigned to the same/overlapping text
    segment by adding weights (heuristic confidence) to the slots
    – State-of-the-art performance – F1-measure of 75 to 98
    However, those methods are shown to be effective only
    for small, homogenous data
  • 13. Finite-State Template-Filling Systems
    Message Understanding Conferences (MUC) – the
    genesis of IE
    DARPA funded significant efforts in IE in the early to mid 1990’s.
    MUC was an annual event/competition where results were
    presented.
  • 14. Finite-State Template-Filling Systems
    – Focused on extracting information from news articles:
    • Terrorist events (MUC-4, 1992)
    • Industrial joint ventures (MUC-5, 1993)
    • Company management changes
    – Informationextraction of particular interest to the intelligence
    community (CIA, NSA). (Note: early ’90’s)
  • 15. Applications
    It has a wide range of application in
    search engines
    biomedical field
    Customer profile analysis
    Trend analysis
    Information filtering and routing
    Event tracks
    news stories classification
  • 16. conclusion
    In this presentation we studied about
    Goals of information extraction
    Entity Extraction: The Maximum Entropy method
    Template filling
    Applications
  • 17. Visit more self help tutorials
    Pick a tutorial of your choice and browse through it at your own pace.
    The tutorials section is free, self-guiding and will not involve any additional support.
    Visit us at www.dataminingtools.net