Textmining Information Extraction

Text Mining:Information extraction

Goals of information extraction “Processing of natural language texts for the extraction of relevant content pieces” (MARTÍ AND CASTELLÓN, 2000) Raw texts => structured databases Templates filling Improving search engines Auxiliary tool for other language applications

Name Entity Recognition Named Entities are proper names in texts, i.e. the names of persons, organizations, locations, times and quantities. NER is the task of processing a text and identifying named entities.

Why is Named Entity Recognition difficult? -Names too numerous to include in dictionaries -Variations e.g. John Smith, Mr Smith, John -Changing constantly new names invent unknown words -Ambiguity For some proper nouns it is hard to determine the category Name

Example Delimit the named entities in a text and tag them withNE Categories: – entity names - ENAMEX – temporal expressions - TIMEX – number expressions - NUMEX Subcategories of tags – captured by a SGML tag attribute called TYPE

Example Original text: The U.K. satellite television broadcaster said its subscriber base grew 17.5 percent during the past year to 5.35 million • Tagged text: The <ENAMEX TYPE="LOCATION">U.K.</ENAMEX> satellite television broadcaster said its subscriber base grew <NUMEX TYPE="PERCENT">17.5 percent</NUMEX> during <TIMEX TYPE="DATE">the past year</TIMEX> to 5.35 million Example

Maximum Entropy for NER Use the probability distribution that has maximum entropy, or that is maximally uncertain, from those that are consistent with observed evidence • P = {models consistent with evidence} • H(p) = entropy of p • PME = argmax p∈P H(p)

Maximum Entropy for NER Given a set of answer candidates Model the probability Define Features Functions Decision Rule

Template Filling A template is a frame (of a record structure), consisting of slots and fillers. A template denotes an event or a semantic concept. After extracting NEs, relations and events, IE fills an appropriate template

Template filling techniques Two common approaches for templatefilling: – Statistical approach – Finite-state cascade approach

Again, by using a sequence labeling method: Label sequences of tokens as potential fillers for a particular slot Train separate sequence classifiers for each slot Slots are filled with the text segments identified by each slot’s corresponding classifier Statistical Approach

Statistical Approach – Resolve multiple labels assigned to the same/overlapping text segment by adding weights (heuristic confidence) to the slots – State-of-the-art performance – F1-measure of 75 to 98 However, those methods are shown to be effective only for small, homogenous data

Finite-State Template-Filling Systems Message Understanding Conferences (MUC) – the genesis of IE DARPA funded significant efforts in IE in the early to mid 1990’s. MUC was an annual event/competition where results were presented.

Finite-State Template-Filling Systems – Focused on extracting information from news articles: • Terrorist events (MUC-4, 1992) • Industrial joint ventures (MUC-5, 1993) • Company management changes – Informationextraction of particular interest to the intelligence community (CIA, NSA). (Note: early ’90’s)

Applications It has a wide range of application in search engines biomedical field Customer profile analysis Trend analysis Information filtering and routing Event tracks news stories classification

conclusion In this presentation we studied about Goals of information extraction Entity Extraction: The Maximum Entropy method Template filling Applications

Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

Textmining Information Extraction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Textmining Information Extraction

Similar to Textmining Information Extraction (20)

Recently uploaded

Recently uploaded (20)

Textmining Information Extraction