Your SlideShare is downloading. ×
0
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Textmining Information Extraction
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Textmining Information Extraction

962

Published on

Textmining Information Extraction

Textmining Information Extraction

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
962
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Text Mining:Information extraction<br />
  • 2. Goals of information extraction<br />“Processing of natural language texts for the extraction of relevant content pieces” (MARTÍ AND CASTELLÓN, 2000)<br />Raw texts =&gt; structured databases<br />Templates filling<br />Improving search engines<br />Auxiliary tool for other language applications<br />
  • 3. Name Entity Recognition<br />Named Entities are proper names in texts, i.e. the<br />names of persons, organizations, locations, times and<br />quantities.<br /> NER is the task of processing a text and identifying<br />named entities.<br />
  • 4. Why is Named Entity Recognition difficult?<br />-Names too numerous to include in dictionaries<br /> -Variations<br />e.g. John Smith, Mr Smith, John<br /> -Changing constantly<br />new names invent unknown words<br /> -Ambiguity<br />For some proper nouns it is hard to determine the<br />category<br />Name<br />
  • 5. Example<br />Delimit the named entities in a text and tag them withNE<br />Categories:<br /> – entity names - ENAMEX<br /> – temporal expressions - TIMEX<br /> – number expressions - NUMEX<br />Subcategories of tags<br /> – captured by a SGML tag attribute called TYPE<br />
  • 6. Example<br />Original text:<br /> The U.K. satellite television broadcaster said its subscriber base grew 17.5 percent during the past year to 5.35 million<br />• Tagged text:<br />The &lt;ENAMEX TYPE=&quot;LOCATION&quot;&gt;U.K.&lt;/ENAMEX&gt;<br />satellite television broadcaster said its subscriber base grew<br />&lt;NUMEX TYPE=&quot;PERCENT&quot;&gt;17.5 percent&lt;/NUMEX&gt;<br />during &lt;TIMEX TYPE=&quot;DATE&quot;&gt;the past year&lt;/TIMEX&gt; to<br />5.35 million<br />Example<br />
  • 7. Maximum Entropy for NER<br />Use the probability distribution that has maximum entropy, or that is maximally uncertain, from those that are consistent with observed evidence<br />• P = {models consistent with evidence}<br />• H(p) = entropy of p<br />• PME = argmax p∈P H(p)<br />
  • 8. Maximum Entropy for NER<br />Given a set of answer candidates<br />Model the probability<br />Define Features Functions<br />Decision Rule<br />
  • 9. Template Filling<br /> A template is a frame (of a record structure), consisting<br />of slots and fillers. A template denotes an event or a<br />semantic concept.<br /> After extracting NEs, relations and events, IE fills an<br />appropriate template<br />
  • 10. Template filling techniques<br />Two common approaches for templatefilling:<br />– Statistical approach<br />– Finite-state cascade approach<br />
  • 11. Again, by using a sequence labeling method:<br />Label sequences of tokens as potential fillers for a particular slot<br /> Train separate sequence classifiers for each slot<br /> Slots are filled with the text segments identified by each slot’s<br />corresponding classifier<br /> Statistical Approach<br />
  • 12. Statistical Approach<br />– Resolve multiple labels assigned to the same/overlapping text<br />segment by adding weights (heuristic confidence) to the slots<br />– State-of-the-art performance – F1-measure of 75 to 98<br /> However, those methods are shown to be effective only<br />for small, homogenous data<br />
  • 13. Finite-State Template-Filling Systems<br /> Message Understanding Conferences (MUC) – the<br />genesis of IE<br /> DARPA funded significant efforts in IE in the early to mid 1990’s.<br />MUC was an annual event/competition where results were<br />presented.<br />
  • 14. Finite-State Template-Filling Systems<br />– Focused on extracting information from news articles:<br />• Terrorist events (MUC-4, 1992)<br />• Industrial joint ventures (MUC-5, 1993)<br />• Company management changes<br />– Informationextraction of particular interest to the intelligence<br />community (CIA, NSA). (Note: early ’90’s)<br />
  • 15. Applications<br />It has a wide range of application in<br />search engines<br />biomedical field<br /> Customer profile analysis<br />Trend analysis<br />Information filtering and routing<br />Event tracks<br />news stories classification<br />
  • 16. conclusion<br />In this presentation we studied about<br />Goals of information extraction<br /> Entity Extraction: The Maximum Entropy method<br /> Template filling<br /> Applications<br />
  • 17. Visit more self help tutorials<br />Pick a tutorial of your choice and browse through it at your own pace.<br />The tutorials section is free, self-guiding and will not involve any additional support.<br />Visit us at www.dataminingtools.net<br />

×