Cross document coreference                               Kepa Joseba Rodr´ıguez                      Seminar on EXtreme In...
Outline               Background.                       Intra-doc/cross-doc coreference tasks.                       Overv...
The task of CDC      Cross document coreference occurs when the same person,      place, event or concept is discussed in ...
Intra-document vs. cross-document coreference               There are substantial differences between intra-document       ...
Kepa Joseba Rodr´                ıguez Seminar on EXtreme Information ExtractionCross document coreference
Kepa Joseba Rodr´                ıguez Seminar on EXtreme Information ExtractionCross document coreference
Unsupervised personal name disambiguation (1)               A personal name can refer to thousands of different            ...
Unsupervised personal name disambiguation (2)               Earlier approaches to personal name disambiguation use        ...
Unsupervised personal name disambiguation (3)               Use of information extraction techniques can add              ...
Generation of extraction patterns               Patterns are automatically generated from data.               It is possib...
(R & H) algorithm for pattern extraction (1)               Select items for the query (i.e. +Mozart, +1756)               ...
(R & H) algorithm for pattern extraction (2)               Repeat the same procedure for other terms like                 ...
(R & H) algorithm to calculate precision (1)               Build a collection of documents that contain the question      ...
(R & H) algorithm to calculate precision (2)      Example: precision for the extracted patters for BIRTHDATE.             ...
Unsupervised Clustering      (Mann & Yarowsky, 2003)               Used cluster method: bottom-up centroid agglomerative  ...
Cluster refactoring               Unsupervised agglomerative clustering can lead to               problems.               ...
Methods for vector generation               Baseline               Techniques of selective term weighting.                ...
Baseline               The term vectors are composed of only proper nouns.               The similarity between vectors is...
TF-IDF               Techniques of selective term weighting.               TF-IDF weight (Term Frequency - Inverse Documen...
Mutual Information               Mutual Information: Measure used to evaluate the               mutual dependence between ...
Extracted biographical features (feat)               Use of biographical features extracted with the algorithm            ...
Extracted biographical features (feat)         Type                       Extracted feature         birth place           ...
Extended biographical features (extfeat)               In this method the system gives higher weight to words             ...
word               w(mi)       w(extfeat)                                   adderley            3.50           0          ...
Experiments: the data set               The data set consisted in web pages collected using               Google for a set...
Evaluation on pseudonames               Pseudonames created as follows:                       Take retrieval results from ...
Evaluation on pseudonames               Selected a set of 8 different people:                       Historical figures.     ...
Evaluation on pseudonames                                      Method      Accuracy                                      n...
Evaluation on pseudonames                                                                  feature   set size             ...
Evaluation on naturally ambiguous names               Start with a selection of 4 polysemous names with a               av...
Conclusions               The results of the clustering are improved by:                       Learning and using automati...
Disambiguating geographic       names in a digital libraryKepa Joseba Rodr´                ıguez Seminar on EXtreme Inform...
Outline               Task of the Perseus project.               Problems of the task domain.               External knowl...
Task of the project Perseus               Task of the Perseus Project (Smith & Crane, 2002)               Library with his...
Problems of the domain               The introduction of the entity by a unambiguous mention               is less common ...
Knowledge sources      The system uses external knowledge sources. The most      important are:          Getty Thesaurus o...
Identification and classification of proper-names      The task of identifying the proper names and the first      classificat...
Disambiguation (1)               Based in local context.                       Explicit disambiguating tags put after the ...
Disambiguation (2)               Based in document context.                       Preponderance of geographic references i...
Simple characterisation of the document context               Aggregate all of the possible locations for all the         ...
Final disambiguation.               Local context of a toponym is represented by a moving               window of the four...
Evaluation (1)      The system has evaluated using 5 hand-annotated corpora.                       Corpus                 ...
Evaluation (2)               Categorisation performed on texts of the Greek and               Roman history texts is bette...
Conclusions               Simple heuristic categorisation seems to work properly for               the categorisation of e...
NewsExplorer: multilingual         coreference resolutionKepa Joseba Rodr´                ıguez Seminar on EXtreme Informa...
NewsExplorer               NewsExplorer (Steinberger & Pouliquen, 2008) is an               application that gathers and a...
Text analyse components of the system (1)               Monolingual document clustering.               Named entity recogn...
Text analyse components of the system (2)               Categorisation of documents according to a multilingual           ...
Language independent rules for geo-tagging               Use of document context:                       If a name can be a...
Language independent rules for geo-tagging               In case of polysemy, most important places will be               ...
Inflection and regular variations (1)               Hyphen/space alternations (Jean-Marie / Jean Marie).               Diac...
Inflection and regular variations (2)               Morphological declensions: use of prefixes and suffixes in               s...
Identification of name variants               Some of these variants can be predicted and generated               using set...
Doc. categorisation with multilingual thesaurus (1)               Eurovoc Thesaurus: hierarchically organised controlled  ...
Doc. categorisation with multilingual thesaurus (2)               NewsExplorer produces a ranked set of words statisticall...
ThanksKepa Joseba Rodr´                ıguez Seminar on EXtreme Information ExtractionCross document coreference
References (1)               Bagga, A. and Baldwin, B (1998). Entity-based cross               document coreferencing usin...
References (2)               Ravichandran, D. and Hovy, E. (2002). Learning surface               text patterns for a ques...
Upcoming SlideShare
Loading in...5
×

Cross Document Coreference

271

Published on

"Cross Document Coreference" - Slides at the Seminar on EXtreme Information Extraction. 25. March 2009. -- University of Trento. Italy.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
271
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Cross Document Coreference"

  1. 1. Cross document coreference Kepa Joseba Rodr´ıguez Seminar on EXtreme Information Extraction Rovereto, 25. March 2009Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  2. 2. Outline Background. Intra-doc/cross-doc coreference tasks. Overview of a system. Unsupervised personal name disambiguation. Generation of extraction patterns. Algorithm of (Ravichandran & Hovy, 2002) Generation of vectors and clustering. Evaluation Optional: Disambiguation of geographic names. Optional: Clustering of news.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  3. 3. The task of CDC Cross document coreference occurs when the same person, place, event or concept is discussed in more than one text source. (Bagga & Baldwin 1998)Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  4. 4. Intra-document vs. cross-document coreference There are substantial differences between intra-document and cross document coreference resolution. In a document there is a certain consistency that we cannot expect across documents. Most underlying principles of linguistics and discourse contexts cannot be applied across documents. There are some links between both. The resolution of intra-document coreference helps in the resolution of cross document coreference. The resolution of cross document coreference can help in the resolution of intra-document coreference (Haghighi & Klein, 2007).Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  5. 5. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  6. 6. Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  7. 7. Unsupervised personal name disambiguation (1) A personal name can refer to thousands of different entities in the real world. Ex: for the name Jim Clark Google shows 76.000 different web-sites (Man & Yarowsky, 2003): 1 Jim Clark Race car driver from Scotland 2 Jim Clark Clock-maker from Colorado 3 Jim Clark Film editor 4 Jim Clark Netscape founder 5 Jim Clark Disaster survivor 6 Jim Clark Car salesman in Kansas ... Jim Clark ... Each entry has features that may be helpful to disambiguate the entity.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  8. 8. Unsupervised personal name disambiguation (2) Earlier approaches to personal name disambiguation use representations of the context like vectors. Distinction between instances with identical name based on potentially indicative words. Jim Clark - car Jim Clark - film Jim Clark - Netscape Jim Clark - Colorado In the case of personal names there is more precise information available than in other kind of entities.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  9. 9. Unsupervised personal name disambiguation (3) Use of information extraction techniques can add categorial information like: Age/date of birth. Nationality. Profession. Space of associated names. It can be used: As a vector based bag-of-words model. With extracted specific types of association, such as: familiar relationships: son, wife, married with... employment relationship: manager of, etc ...Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  10. 10. Generation of extraction patterns Patterns are automatically generated from data. It is possible to get a good performance without use of parser or other language specific resources. Automatic generation is more flexible to be applied to new languages. Potentially higher precision and recall than patterns introduced by hand.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  11. 11. (R & H) algorithm for pattern extraction (1) Select items for the query (i.e. +Mozart, +1756) Search in a document collection for documents that contains both terms. Extract the sentences in which both terms are contained. Search for the long matches between sentences. For the sentences: The great composer Mozart (1756-1791) achieved fame as a young age. Mozart (1756-1791) was a genius. The whole world would always be indebted to the great music of Mozart (1756-1791). the longest matching substring is “Mozart (1756-1791)”Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  12. 12. (R & H) algorithm for pattern extraction (2) Repeat the same procedure for other terms like +Newton +1642 +Gandhi +1869 ... For BIRTHDATE the algorithm produces this output: born in <ANSWER>, <NAME> <NAME> was born in <ANSWER> <NAME> (<ANSWER> - <NAME> (<ANSWER> -) ...Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  13. 13. (R & H) algorithm to calculate precision (1) Build a collection of documents that contain the question term (the name) Query a search engine using only the question term Download the top 1000 web documents. Extract the sentences that contain the question term. For each extracted pattern, check the presence in the sentence obtained for the following instances Presence of the pattern with <ANSWER> tag matched by any word (Ca ) i.e: Mozart was born in <WORD>. Presence of the pattern with <ANSWER> tag matched by the correct term (Co ) i.e: Mozart was born in 1756. P = Co /CaKepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  14. 14. (R & H) algorithm to calculate precision (2) Example: precision for the extracted patters for BIRTHDATE. 1.0 <NAME> (<ANSWER> -) 0.85 <NAME> was born on <ANSWER> 0.6 <NAME> was born in <ANSWER> 0.59 <NAME> was born <ANSWER> 0.53 <ANSWER> <NAME> was bornKepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  15. 15. Unsupervised Clustering (Mann & Yarowsky, 2003) Used cluster method: bottom-up centroid agglomerative clustering. Each document is represented by a vector of automatically extracted features. The two more similar vectors are merged to produce a new cluster. The new cluster is represented by a vector equal to the centroid of the clustered vectors.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  16. 16. Cluster refactoring Unsupervised agglomerative clustering can lead to problems. The most similar pages are clustered at the begin of the process. The less similar pages are added as stragglers to the top levels of the cluster tree. The top-level clusters are less discriminative than the clusters at the bottom of the tree. The refactoring. Clustering is stopped when a percentage of the documents have been classified and clusters have achieved a given size. The rest of the documents are assigned to the clusters with the closest distance measure.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  17. 17. Methods for vector generation Baseline Techniques of selective term weighting. Term Frequency / Inverse Document Frequency (tf-idf) Mutual Information (mi). Biographical features (feat) Extended biographical features (extfeat) Cluster refactoring.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  18. 18. Baseline The term vectors are composed of only proper nouns. The similarity between vectors is computed using standard cosine similarity. a·b cos(a, b) = ||a|| × ||b||Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  19. 19. TF-IDF Techniques of selective term weighting. TF-IDF weight (Term Frequency - Inverse Document Frequency) Measure used to evaluate how important a word is to a document in a collection. The importance decreases proportionally to the number of times a word appears in a document, but it is offset by the frequency of the word in the collection. n |D| tfi,j = P i,j idfi = tfidfi,j = tfi,j × idfi k nk,j |d:ti ∈d|Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  20. 20. Mutual Information Mutual Information: Measure used to evaluate the mutual dependence between random variables. Given a document collection c, for each word w we compute I (w ; c) = p(w |c) p(w ) We selected words that appear more than 20 times in the collection have a I (w ; c) > 10 these words are added to the document’s feature vector with a weight equal to log (I (w ; c))Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  21. 21. Extracted biographical features (feat) Use of biographical features extracted with the algorithm of (Ravichendran & Hovy, 2002) Biographical information is used to link the documents: documents which contain similar extracted features have the same referent. The extracted biographical features help to improve disambiguation: documents with different extracted features belong to different clusters.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  22. 22. Extracted biographical features (feat) Type Extracted feature birth place Midland (4), Texas (3), Alton (1), Illinois(1) birth year 1926 (9). 1967 (3), 1973 (2), 1947 (1), 1958 (1), 1969 (1) occupation actor (11), trumpeter (9), heavyweight (2), ... spouse Demi Moore (1) Table: feat Features extracted for Davis/Harrelson pseudonameKepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  23. 23. Extended biographical features (extfeat) In this method the system gives higher weight to words that appear filling patterns. Example: The system recognises 1756 as a birth-year using surface patterns. Then when it is found in context outside of an extraction pattern, it is given a higher weight and added to the document vector as a potential biographical feature. For the experiment it was applied for words which appears more than a threshold of 4 times. Then value of the weight is the log of the number of times the word was found as an extracted feature.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  24. 24. word w(mi) w(extfeat) adderley 3.50 0 snipes 5.16 0 coltrane 5.06 0 bitches 4.99 0 danson 4.97 0 hemp 4.97 0 mullally 4.95 0 porgy 4.94 0 remastered 4.92 0 actor 3.50 2.40 1926 0 2-20 trumpeter 0 2.20 midland 0 1.39 Table: 10 words with higher mutual information with the document collection and all extfeat words for Davis/Harrelson pseudonameKepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  25. 25. Experiments: the data set The data set consisted in web pages collected using Google for a set of target personal names. Not more than 1000 pages for each target name. No requirement that the web-page was focused on the name. No minimum number of occurrences of the name in the page.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  26. 26. Evaluation on pseudonames Pseudonames created as follows: Take retrieval results from two different people. Replace all references to each name by a unique shared pseudoname. Resulting collection consists of documents which are ambiguous as to whom they are talking about. The aim of the clustering is to distinguish the introduced pseudoname.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  27. 27. Evaluation on pseudonames Selected a set of 8 different people: Historical figures. Figures from media and pop culture. Non famous people with similar background (birthdate, profession, etc.) Submit Google queries and retrieval up to 1000 pages about each person. Select a maximum of 100 pages for each person. Evaluation of two granularities of feature extraction: Use high precision rules to extract occupation, birthday, spouse, birth location and school. Use high recall rules to extract the same terms and add parent/child relationships.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  28. 28. Evaluation on pseudonames Method Accuracy nnp 79.7 nnp + tfidf 79.7 nnp + mi 82.9 Table: Disambiguation accuracy of different clustering methodsKepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  29. 29. Evaluation on pseudonames feature set size extracted features small large nnp+feat 82.5 85.1 nnp+feat+extfeat 82.0 84.6 nnp+feat+mi 85.6 85.3 nnp + feat + tfidf 82.9 86.4 Table: Disambiguation accuracy of different clustering methods and different size of feature setsKepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  30. 30. Evaluation on naturally ambiguous names Start with a selection of 4 polysemous names with a average of 60 different instances for each of them. Manual annotation with name-ID numbers The occurrences of each name should be classified into 3 clusters The 2 automatically derived first-pass majority seed sets. The residual set for “other uses” Weighting method Precision Recall TF-IDF .81 .70 Mutual Information .88 .73Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  31. 31. Conclusions The results of the clustering are improved by: Learning and using automatically extracted biographic information. The use of weighting techniques. The produced clusters can be used as seeds for disambiguating further entities.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  32. 32. Disambiguating geographic names in a digital libraryKepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  33. 33. Outline Task of the Perseus project. Problems of the task domain. External knowledge sources. Identification and classification of proper names. First disambiguation of geographical names. Simple carachterization of the document context. Final disambiguation.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  34. 34. Task of the project Perseus Task of the Perseus Project (Smith & Crane, 2002) Library with historical data in humanities from the ancient Greece to the 19th century America. Over a million of toponym references. The task consist of: Identification of geographic names. Link the names to information about location, type, dates of occupation, relation to other places, inhabitants, etc. Link the names to a position in a map.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  35. 35. Problems of the domain The introduction of the entity by a unambiguous mention is less common than in new papers articles. There is a great difference between the documents, like Different size of the documents. Lack of standard structures. Different registers and dialects are used. Historical variations: borders, names associated to different political systems, etc. Long distance anaphora. Resolution process is more similar to the resolution of cross-document coreference in the web than in corpora.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  36. 36. Knowledge sources The system uses external knowledge sources. The most important are: Getty Thesaurus of geographic names. Cruchley’s gazetteer of London, that were build for geocoding. Lists of authors of the entries in the Dictionary of National Biography, that helps to add additional information to the documents.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  37. 37. Identification and classification of proper-names The task of identifying the proper names and the first classification of them is done using simple heuristics. Capitalisation and punctuation conventions. Markup added by the editor of the document. Language specific honorifics (Mr., Dr., etc). Generic topographic labels are taken as “moderate” evidence that the name may be geographic. Rocky Mountains Charles River Stand-alone names are preferably classified as personal names. John (personal name vs. village in Louisiana or Virginia)Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  38. 38. Disambiguation (1) Based in local context. Explicit disambiguating tags put after the names. e.g.“Lancaster, PA”, “Vienna, Austria”, post code, etc. If an ambiguous name of a place is mentioned together with other names of places, the most likely interpretation of the name is that is geographically near from the others. e.g. if “Philadelphia” and “Harrisburg” appear in the same paragraph, the preferred interpretation of “Lancaster” will be the town in Pennsylvania, and not the town in England or Arizona.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  39. 39. Disambiguation (2) Based in document context. Preponderance of geographic references in the entire document. For short documents, like new papers articles, document context and local context are considered as the same. Based in word knowledge. Captured from gazetteers and other reference works. Facts about a place like political coordinates, size, etc.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  40. 40. Simple characterisation of the document context Aggregate all of the possible locations for all the toponyms in the document onto a one-by-one degree grid. Assign weights for the number of mentions of each toponym. Prune the grid based on general world knowledge. Compute the centroid of this weighted map. Compute the standard deviation of the distance of the points from this centroid. Discard points more than to times the standard deviation away from the centroid. Calculate a new centroid.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  41. 41. Final disambiguation. Local context of a toponym is represented by a moving window of the four previous and four following toponyms in the text. Only non ambiguous or disambiguated toponyms are considered. Each of the possible interpretations of the ambiguous toponym are scored using: Geographical proximity to the toponyms around it. Proximity to the centroid for the document. Relative importance. The interpretation that achieves the highest score is selected.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  42. 42. Evaluation (1) The system has evaluated using 5 hand-annotated corpora. Corpus PCat Prec Rec F1 Greek 0.98 0.93 0.99 0.96 Roman 0.99 0.91 1.00 0.95 London 0.92 0.86 0.96 0.91 California 0.92 0.83 0.96 0.89 Upper Midwest 0.89 0.74 0.89 0.81Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  43. 43. Evaluation (2) Categorisation performed on texts of the Greek and Roman history texts is better than on texts about more actual items. In places with a hight density of population we found more toponyms that are ambiguous with other names. Mistakes where ethnonyms are used as geo-political Entity (like “The Germans” in the text Cæsar’s Gallic War). Proper names are usually non inflected in English. We can add rules by hand to correct it, but the precision of the system could decrease.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  44. 44. Conclusions Simple heuristic categorisation seems to work properly for the categorisation of entities that appear in certain kind of texts. The evaluation procedure is not very clear. There are cases that are not covered properly by the gazetteers, but the use of huge fine grained gazetteers leads to a higher recall but a lower precision. An alternative is the use of linguistic processing and machine learning techniques for restricted cases and collections of documents.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  45. 45. NewsExplorer: multilingual coreference resolutionKepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  46. 46. NewsExplorer NewsExplorer (Steinberger & Pouliquen, 2008) is an application that gathers and aggregates extracted information for 19 languages. Each entity is displayed on a dedicated web-site. For each entity the user get: List of the latest new clusters in which the entity has been mentioned. List of other entities found in the same clusters. Titles and other phrases describing the entity. Quotations done by the entity or about it. Photograph if available. Wikipedia site about the entity if available.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  47. 47. Text analyse components of the system (1) Monolingual document clustering. Named entity recognition. Person. Organisation. Geographical location. Named entity disambiguation. Quotation recognition and reference resolution for name parts. Identification and mapping of name variants for the same person. Topic detection and tracking.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  48. 48. Text analyse components of the system (2) Categorisation of documents according to a multilingual thesaurus. Cluster similarity calculation: monolingual. across languages.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  49. 49. Language independent rules for geo-tagging Use of document context: If a name can be a personal name or the name of a place, if it has been mentioned as a person earlier, then the preferred reading is that it is a person. If a name can be a personal name or the name of a place, if it has been mentioned as a person earlier, then the preferred reading is that it is a person. If a country has been mentioned in the text, and then appear a polysemous item, resolve the ambiguity in favour of a place in the mentioned country. Prefer locations that are physically closer to other, non ambiguous locations than have been mentioned in the context.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  50. 50. Language independent rules for geo-tagging In case of polysemy, most important places will be preferred. Ignore places that cannot be disambiguated. Combine the rules giving different weights.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  51. 51. Inflection and regular variations (1) Hyphen/space alternations (Jean-Marie / Jean Marie). Diacritic variations (Schr¨der / Schroder). o Name inversion: change of position between first and last name. Typos: relatively frequents in names like Condoleezza Rice, often written as Condoleza, Condolezza, etc. Simplification: Condoleezza Rice and George W. Bush are frequently simplified as Ms. Rice and President Bush.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  52. 52. Inflection and regular variations (2) Morphological declensions: use of prefixes and suffixes in several languages. Transliteration from other alphabets: there is not a 1x1 mapping between letters. there are different conventions. Vowel variations, specially in transliterations from and into Arabic.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  53. 53. Identification of name variants Some of these variants can be predicted and generated using sets of regular expressions. i.e. declination of personal names in Sloven: s/[aeo]?/(e|a|o|u|om|em|m|ju|jem|ja)?/ For every frequent name in the data base will be generated a pattern like Pierr(e|a|o|u|om|em|m|ju|jem|ja)? Gemayel(e|a|o|u|om|em|m|ju|jem|ja)? For cases that cannot be resolved by the regular expressions: Normalise the names, translating them to a language-independent representation. Compute edit distance between name-variant and normalised-names.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  54. 54. Doc. categorisation with multilingual thesaurus (1) Eurovoc Thesaurus: hierarchically organised controlled vocabulary developed by European institutions and national parliaments of different countries. It is used in public administrations for cataloguing, search and retrieval of large multilingual collections. The thesaurus consists of 6000 descriptors organised in 21 fields and at the second level into 127 micro-thesauri.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  55. 55. Doc. categorisation with multilingual thesaurus (2) NewsExplorer produces a ranked set of words statistically related to the descriptor. These sets of words were produced on the basis of a large amount of hand annotated documents, by comparing word frequencies of the subset of texts indexed which each descriptors with the word frequencies of the whole training corpus. This model is completed with a list of stop words to avoid that irrelevant words have an impact in the categorisation task.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  56. 56. ThanksKepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  57. 57. References (1) Bagga, A. and Baldwin, B (1998). Entity-based cross document coreferencing using the vector space model. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics. Haghighi, A. and Klein, D. (2007). Unsupervised Coreference Resolution in a Nonparametric Bayesian Model. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Mann, G.S. and Yarowsky, D. (2003). Unsupervised Personal Name Disambiguation. In Proceedings of the CoNLL.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  58. 58. References (2) Ravichandran, D. and Hovy, E. (2002). Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Smith, D.A. and Crane, G. (2002). Disambiguating geographic names in a historical digital library. In Proceedings of ECDL. Steinberger, R. & Pouliquen, B. (2008): NewsExplorer - combining various text analysis tools to allow multilingual news linking and exploration. Lecture notes for the lecture held at the SORIA Summer School “Cursos de Tecnolog´ Ling¨´ ıas uısticas”.Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information ExtractionCross document coreference
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×