Your SlideShare is downloading. ×

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
155
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. G. Futia F. Cairo F. Morando L. Leschiutta Exploiting Linked Open Data and Natural Language Processing for Classification of Political Speech Krems, 22nd May 2014
  • 2. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 2 Introduction ● Our goal: ● assist anyone interested in automatic categorization of political speeches, to identify unambiguously the main political trends addressed by the White House ● What we have to achieve our goal: ● TellMeFirst (http://tellmefirst.polito.it/), a topic extraction tool: – it leverages DBpedia knowledge base and English Wikipedia linguistic corpus – it exploits Linked Open Data (LOD) and Natural Language Processing (NLP) techniques
  • 3. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 3 DBpedia ● A crowd-sourced community effort to extract structured information from Wikipedia and a central interlinking hub for the Linking Open Data project. ● It is a suitable knowledge base for text classification (Mendes et al., 2012; Hellmann et al., 2013; Steinmetz et al., 2013)
  • 4. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 4 Why DBpedia for US political speeches? Comparison between the coverage of US politics and the coverage of politics of other countries The coverage of politics in Wikipedia is “often very good for recent or prominent topics but is lacking on older or more obscure topics” (Brown, 2011).
  • 5. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 5 Text Categorization Approach ● An instance-based approch:TellMeFirst assigns target documents to classes based on a local comparison between a set of pre-classified documents and the target document itself ● This training set consists of all the Wikipedia paragraphs where a wikilink occurs.These paragraphs are stored in a Lucene index, where each document represents a DBpedia resource
  • 6. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 6 Success rate (%) of the TellMeFirst classification process on US Presidents profiles 1st topic Within the first 2 topics Within the first 7 topics Full text of the Presidents profiles 95.4% 100% 100% President profiles without name and surname 45.4% 61.3% 90.9% TellMeFirst provides as output the seven most relevant topics (in the form of DBpedia URI) of the document sorted by relevance
  • 7. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 7 whitehouse.gov ● 3173 videos in English were available on the White House website on the 24th of November 2013 ● These videos are categorized according to a taxonomy not related to the subject of the speeches ● They need a semantic layer that point out the content of the speeches, so that questions such as “what is the First Lady talking about?” could be automatically answered
  • 8. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 8 Not just a bag-of-words tool Results obtained with TellMeFirst (on the left) and withTagCrowd (on the right) «President Obama Speaks on the Affordable Care Act» http://1.usa.gov/1jR4Ky2
  • 9. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 9 Results (i) Occ. % overall % 2013 % 2012 % 2011 % 2010 % 2009 Barack Obama 607 4.88% 5.68% 4.52% 5.51% 4.45% 3.88% Patient Protection and Affordable Care Act 286 2.30% 3.06% 1.35% 1.91% 2.47% 2.71% American Recovery and Reinvestment Act of 2009 278 2.23% 1.09% 1.82% 2.88% 2.84% 1.88% Social Security 272 2.19% 2.58% 1.77% 3.54% 1.61% 0.78% Amount and percentage of topic occurrences extracted with TellMeFirst
  • 10. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 10 Results (ii) ● “New Deal” (141 occurrences), probably used as a metaphor within the political speeches of President Obama ● “Libya” has a value corresponding to 1.00% in 2011.This result can be related to the full-scale revolt beginning on 17 February 2011 in Libya ● “Deepwater Horizon oil spill” reaches the 1.05% in 2010.This result is related to the marine oil spill which took place in the Gulf of Mexico that began on 20 april 2010
  • 11. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 11 Correlation among topics
  • 12. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 12 A focus on the First Lady (i) ● According to Michelle Obama’s page on the White House website, the First Lady “looks forward to continuing her work on the issues close to her heart”: ● supporting military families ● helping working women balance career and family encouraging national service ● promoting the arts and arts education ● fostering healthy eating and healthy living for children and families across the country
  • 13. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 13 A focus on the First Lady (ii) ● We tested whether TellMeFirst confirms or not these impressions and claims, manually selecting nine Wikipedia categories which seemed to be related to these issues ● We then interrogated the SPARQL end-point of DBpedia with a query to collect all the topics of these categories ● We then associated each topic to one or more of the nine high-level categories: these categories encompassed almost 75% of the topics
  • 14. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 14 A focus on the First Lady (iii) Wikipedia Category First Lady sp. 9 categories All speeches 9 categories Government of the United States 26.68% 32.68% Education 21.64% 5.40% Nutrition 19.96% 1.61% Social issues 14.71% 28.38% Barack Obama 13.66% 14.00% Health care 11.34% 7.57% Arts 8.61% 1.11% Military personnel 3.99% 3.16% Gender equality 2.73% 0.84% Others (unclassified topics) 25.63% 38.34%
  • 15. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 15 Conclusions (i) ● The ability for citizens to easily retrieve the content of political speeches and decisions is a crucial factor in e-participation ● Not guaranteed by a traditional keywords search, as in most of the public administration websites (the White House website included) ● Example: in a keyword-based system, by typing the word "education", for instance, users get as result only videos that have the word education in their title ● All terms that belong to the semantic area of education are omitted
  • 16. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 16 Conclusions (ii) ● When documents are semantically classified through DBpedia URIs all synonyms, hypernyms and hyponyms of lemmas are traced to the same concept making user search more effective ● Leveraging Wikipedia categories would allow to go even a step further, taking advantage of the links between concepts as designed by the Wikipedia community
  • 17. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 17 Next steps ● Building a content search/navigation layer around the scraping/classification module ● Integration with other Linked Open Data repositories on the Web, combining the extracted topics with other information (President Obama's federal budget proposal?)
  • 18. Thank you! Giuseppe Futia (giuseppe.futia@polito.it) This paper was drafted in the context of the Network of Excellence in Internet Science EINS (GA n°288021), and, in particular, in relation with the activities concerning Evidence and Experimentation (JRA3).
  • 19. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 19 Appendix - The algorithm ● The classifier needs to hold in memory all the instances of the training set and calculate, during classification stage, the vector distance between training documents and target documents. ● Specifically, the algorithm used by TMF is k-Nearest Neighbor (kNN), a type of memory-based approach which selects the categories for a target document on the basis of the k most similar documents within the vector space.
  • 20. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 20 Appendix - Scoring formula ● In a Lucene query, both the target document and the training set become weighed terms vectors, where terms are weighted by means of the TF-IDF algorithm.The query returns a list of documents in the form of DBpedia URIs, ordered by similarity score. Scoring formula is:
  • 21. 22nd May 2014 Giuseppe Futia – Politecnico di Torino 21 Appendix - Basic concepts ● Natural Language Processing - A field of computer science, concerned with the interactions between computers and human (natural) languages. ● Linked Data - A recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF. ● DBpedia - A crowd-sourced community effort to extract structured information from Wikipedia and a central interlinking hub for the Linking Open Data project.

×