Textmining Information Extraction


Published on

Introduction to Text Mining: Information Extraction

Published in: Technology, Education
  • Be the first to comment

Textmining Information Extraction

  1. 1. Text Mining:Information extraction<br />
  2. 2. Goals of information extraction<br />“Processing of natural language texts for the extraction of relevant content pieces” (MARTÍ AND CASTELLÓN, 2000)<br />Raw texts =&gt; structured databases<br />Templates filling<br />Improving search engines<br />Auxiliary tool for other language applications<br />
  3. 3. Name Entity Recognition<br />Named Entities are proper names in texts, i.e. the<br />names of persons, organizations, locations, times and<br />quantities.<br /> NER is the task of processing a text and identifying<br />named entities.<br />
  4. 4. Why is Named Entity Recognition difficult?<br />-Names too numerous to include in dictionaries<br /> -Variations<br />e.g. John Smith, Mr Smith, John<br /> -Changing constantly<br />new names invent unknown words<br /> -Ambiguity<br />For some proper nouns it is hard to determine the<br />category<br />Name<br />
  5. 5. Example<br />Delimit the named entities in a text and tag them withNE<br />Categories:<br /> – entity names - ENAMEX<br /> – temporal expressions - TIMEX<br /> – number expressions - NUMEX<br />Subcategories of tags<br /> – captured by a SGML tag attribute called TYPE<br />
  6. 6. Example<br />Original text:<br /> The U.K. satellite television broadcaster said its subscriber base grew 17.5 percent during the past year to 5.35 million<br />• Tagged text:<br />The &lt;ENAMEX TYPE=&quot;LOCATION&quot;&gt;U.K.&lt;/ENAMEX&gt;<br />satellite television broadcaster said its subscriber base grew<br />&lt;NUMEX TYPE=&quot;PERCENT&quot;&gt;17.5 percent&lt;/NUMEX&gt;<br />during &lt;TIMEX TYPE=&quot;DATE&quot;&gt;the past year&lt;/TIMEX&gt; to<br />5.35 million<br />Example<br />
  7. 7. Maximum Entropy for NER<br />Use the probability distribution that has maximum entropy, or that is maximally uncertain, from those that are consistent with observed evidence<br />• P = {models consistent with evidence}<br />• H(p) = entropy of p<br />• PME = argmax p∈P H(p)<br />
  8. 8. Maximum Entropy for NER<br />Given a set of answer candidates<br />Model the probability<br />Define Features Functions<br />Decision Rule<br />
  9. 9. Template Filling<br /> A template is a frame (of a record structure), consisting<br />of slots and fillers. A template denotes an event or a<br />semantic concept.<br /> After extracting NEs, relations and events, IE fills an<br />appropriate template<br />
  10. 10. Template filling techniques<br />Two common approaches for templatefilling:<br />– Statistical approach<br />– Finite-state cascade approach<br />
  11. 11. Again, by using a sequence labeling method:<br />Label sequences of tokens as potential fillers for a particular slot<br /> Train separate sequence classifiers for each slot<br /> Slots are filled with the text segments identified by each slot’s<br />corresponding classifier<br /> Statistical Approach<br />
  12. 12. Statistical Approach<br />– Resolve multiple labels assigned to the same/overlapping text<br />segment by adding weights (heuristic confidence) to the slots<br />– State-of-the-art performance – F1-measure of 75 to 98<br /> However, those methods are shown to be effective only<br />for small, homogenous data<br />
  13. 13. Finite-State Template-Filling Systems<br /> Message Understanding Conferences (MUC) – the<br />genesis of IE<br /> DARPA funded significant efforts in IE in the early to mid 1990’s.<br />MUC was an annual event/competition where results were<br />presented.<br />
  14. 14. Finite-State Template-Filling Systems<br />– Focused on extracting information from news articles:<br />• Terrorist events (MUC-4, 1992)<br />• Industrial joint ventures (MUC-5, 1993)<br />• Company management changes<br />– Informationextraction of particular interest to the intelligence<br />community (CIA, NSA). (Note: early ’90’s)<br />
  15. 15. Applications<br />It has a wide range of application in<br />search engines<br />biomedical field<br /> Customer profile analysis<br />Trend analysis<br />Information filtering and routing<br />Event tracks<br />news stories classification<br />
  16. 16. conclusion<br />In this presentation we studied about<br />Goals of information extraction<br /> Entity Extraction: The Maximum Entropy method<br /> Template filling<br /> Applications<br />
  17. 17. Visit more self help tutorials<br />Pick a tutorial of your choice and browse through it at your own pace.<br />The tutorials section is free, self-guiding and will not involve any additional support.<br />Visit us at www.dataminingtools.net<br />