Textmining Information Extraction

Text Mining:Information extraction

Goals of information extraction“Processing of natural language texts for the extraction of relevant content pieces” (MARTÍ AND CASTELLÓN, 2000)Raw texts => structured databasesTemplates fillingImproving search enginesAuxiliary tool for other language applications

Name Entity RecognitionNamed Entities are proper names in texts, i.e. thenames of persons, organizations, locations, times andquantities. NER is the task of processing a text and identifyingnamed entities.

Why is Named Entity Recognition difficult?-Names too numerous to include in dictionaries -Variationse.g. John Smith, Mr Smith, John -Changing constantlynew names invent unknown words -AmbiguityFor some proper nouns it is hard to determine thecategoryName

ExampleDelimit the named entities in a text and tag them withNECategories: – entity names - ENAMEX – temporal expressions - TIMEX – number expressions - NUMEXSubcategories of tags – captured by a SGML tag attribute called TYPE

ExampleOriginal text: The U.K. satellite television broadcaster said its subscriber base grew 17.5 percent during the past year to 5.35 million• Tagged text:The <ENAMEX TYPE="LOCATION">U.K.</ENAMEX>satellite television broadcaster said its subscriber base grew<NUMEX TYPE="PERCENT">17.5 percent</NUMEX>during <TIMEX TYPE="DATE">the past year</TIMEX> to5.35 millionExample

Maximum Entropy for NERUse the probability distribution that has maximum entropy, or that is maximally uncertain, from those that are consistent with observed evidence• P = {models consistent with evidence}• H(p) = entropy of p• PME = argmax p∈P H(p)

Maximum Entropy for NERGiven a set of answer candidatesModel the probabilityDefine Features FunctionsDecision Rule

Template Filling A template is a frame (of a record structure), consistingof slots and fillers. A template denotes an event or asemantic concept. After extracting NEs, relations and events, IE fills anappropriate template

Template filling techniquesTwo common approaches for templatefilling:– Statistical approach– Finite-state cascade approach

Again, by using a sequence labeling method:Label sequences of tokens as potential fillers for a particular slot Train separate sequence classifiers for each slot Slots are filled with the text segments identified by each slot’scorresponding classifier Statistical Approach

Statistical Approach– Resolve multiple labels assigned to the same/overlapping textsegment by adding weights (heuristic confidence) to the slots– State-of-the-art performance – F1-measure of 75 to 98 However, those methods are shown to be effective onlyfor small, homogenous data

Finite-State Template-Filling Systems Message Understanding Conferences (MUC) – thegenesis of IE DARPA funded significant efforts in IE in the early to mid 1990’s.MUC was an annual event/competition where results werepresented.

Finite-State Template-Filling Systems– Focused on extracting information from news articles:• Terrorist events (MUC-4, 1992)• Industrial joint ventures (MUC-5, 1993)• Company management changes– Informationextraction of particular interest to the intelligencecommunity (CIA, NSA). (Note: early ’90’s)

ApplicationsIt has a wide range of application insearch enginesbiomedical field Customer profile analysisTrend analysisInformation filtering and routingEvent tracksnews stories classification

conclusionIn this presentation we studied aboutGoals of information extraction Entity Extraction: The Maximum Entropy method Template filling Applications

Visit more self help tutorialsPick a tutorial of your choice and browse through it at your own pace.The tutorials section is free, self-guiding and will not involve any additional support.Visit us at www.dataminingtools.net

Textmining Information Extraction

More Related Content

What's hot

Viewers also liked

Similar to Textmining Information Extraction

Recently uploaded

Textmining Information Extraction