Your SlideShare is downloading. ×
Unsupervised Learning of a Social Network from a Multiple-Source News Corpus
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Unsupervised Learning of a Social Network from a Multiple-Source News Corpus


Published on

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Unsupervised Learning of Social Networks from a Multiple-Source News Corpus Hristo Tanev European Commission Joint Research Centre
  • 2. Introduction Social networks provide an intuitive picture of inferred relationships between entities, such as people and organizations. Social network analysis uses Social Networks to identify underlying groups, communication patterns, and other information. Manual construction of a social network is very laborious task. Algorithms for automatic detection of relations may be used to save time and human efforts.
  • 3. Introduction We present an unsupervised methodology for automatic learning of social networks We use multiple-source syntactically parsed news corpus. In order to overcome the efficiency problems which emerge from using syntactic information on real-world data, we put forward an efficient graph matching algorithm.
  • 4. Related work Learning social networks from Friend-Of-A-Friend links (Mika 2005) or statistical co-occurrences Disadvantage: cannot detect the type of the relation
  • 5. Related work Support Vector Machines (SVM) provide more accurate means for relation extraction (Zelenko 2003) Disadvantages: • require a sufficient amount of annotated data • each pair of named entities should be evaluated separately, which slows down the relation extraction
  • 6. Related work (Romano 2006) propose a generic unsupervised method for learning of syntactic patterns for relation extraction Disadvantages: • they use the Web as a training corpus, which makes the learning very slow • they match each pattern against each sentence which is not efficient when matching many templates against a big corpus
  • 7. Unsupervised learning of social networks Our algorithm is unsupervised – it accepts on its input one, two, or other small number of two-slot seed syntactic templates which express certain semantic relation. The algorithm uses news clusters to learn new syntactic patterns expressing the same semantic relation. When the patterns are learned we apply a novel efficient methodology for pattern matching to extract related person names from the text. Extracted relations are aggregated in a social network.
  • 8. EMM news clusters European Media Monitor downloads news from different sources around the clock. Every day 4000-5000 English language news are downloaded. The news articles are grouped into topic clusters.
  • 9. Parsing the corpus The training and the test corpus consist of English-language news articles from 200 sources. Articles are parsed with a full dependency parser, MiniPar. meet subj obj in Bush Blair March
  • 10. Learning patterns Provide manually a very small number of seed syntactic templates which express the main relation. For example, for the relation “X supports Y” we use the syntactic patterns: X subj support obj Y X subj praise obj Y
  • 11. Learning patterns Match these templates against the news clusters in the corpus. Each pair of person names which fill the slots X and Y is called an anchor pair. From “Bush praised the Prime Minister Hamid Karzai”, the algorithm will extract the anchor pair (X:Bush; Y:Hamid Karzai)
  • 12. Learning patterns Normalize the anchor pairs using the information in the EMM database. After this step, the example anchor pair will become (X:George W. Bush; Y:Hamid Karzai).
  • 13. Learning patterns For each extracted anchor pair, search in the same cluster all the sentences where both names of the anchor pair occur. The assumption is that the same relation will hold between the same pairs of names in the whole news cluster, since all articles in it have the same topic.
  • 14. Learning patterns From all the sentences in which at least one anchor pair appears, learn syntactic pattern using our pattern-learning algorithm similar to the General Structure Learning algorithm (GSL) described in (Szpektor 2006) Example: X subj-agree-with Y Each pattern obtains as a score the number of different anchor pairs which support it
  • 15. Learning patterns Pattern selection and filtering • Filter out all templates which appear for less than 2 anchor pairs. • Take out generic patterns like “X say Y”, “X have Y”, “X is Y”, etc. using a a predefined template list
  • 16. Syntactic Network model “Prodi met “Berlusconi met President Bush in President Chirac” September”
  • 17. Syntactic Network model
  • 18. Adding syntactic templates
  • 19. Efficiency The worst case time complexity of building SyntNet is O(|w| log |w|), where |w| is the number of the words in the parsed corpus The worst case time complexity of the syntactic matching algorithm is bounded by O((|s|+|t|) (log MaxArcO)), where |s| is the number of the sentences in the corpus, |t| is the number of the templates, and the MaxArcO is the maximum number of occurrences of an SyntNet arc, i.e. the size of the maximal index set of a SyntNet arc
  • 20. Evaluation schema To evaluate our algorithm we learned syntactic patterns for “meeting” and “support” relationships between people We evaluate the algorithm how well it captures relationship between the top 33 VIP from our database We do not evaluate how it captures relation mentions If a specific relation (e.g. “meeting”) holds between a pair of people X and Y, it is sufficient that the algorithm finds at least one mention of this relation between X and Y
  • 21. Experiments For paraphrase learning we used a training corpus of 98'000 English-language news articles clustered in 22'000 EMM topic clusters published in the period 01/May/2006 – 03/Oct/2006. For testing the method, we used 125'000 English-language news articles published in the period 03/Oct/2006 – 31/Oct/2006. To read the test corpus and the templates in the memory and to build SyntNet+ it took 9 min and 3 sec. It took only 45 seconds to match the 101 syntactic templates against the test corpus of about 1'080'000 parsed sentences. We normalized extracted names using the EMM database
  • 22. Relationship extraction evaluation on the top 33 VIP from the EMM DB Precision Recall F1 0.61 0.56 0.58 meeting 0.57 0.10 0.17 support 0.60 0.32 0.42 overall
  • 23. Using the social network view We run the PageRank algorithm on the automatically extracted “meeting” network and found the top 5 ranked people We compared this ranking with simple frequency-based people ranking
  • 24. Comparing two people ranking schemas Pagerank Frequency C. Rice G.W. Bush G.W. Bush T. Blair V. Putin C. Rice E. Olmert N. al-Maliki T. Blair S. Hussein
  • 25. Conclusions and future work We presented an unsupervised method for social network learning from news clusters We presented very efficient syntactic pattern matching algorithm Automatically learned social networks can be used for some analyst tasks In our future work we will try to consider more types of relations We consider learning and using more abstract patterns
  • 26. THANK YOU!