Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Populating DBpedia FR and using it for Extracting Information

636 views

Published on

Talk for the 3rd DBpedia community meeting.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Populating DBpedia FR and using it for Extracting Information

  1. 1. Julien Plu julien.plu@eurecom.fr @julienplu Populating DBpedia FR and using it for Extracting Information
  2. 2. Agenda  Mapping the French infoboxes  How is DBpedia FR used at Orange?  Presentation of the Orange challenge  Project: ExtSem Module 1: ParseText Module 2: BuildDepGraph Module 3: ExtractRDF Module 4: SelectRDF  Experiments 09/02/2015 - 3rd DBpedia Community Meeting – Dublin, Ireland - 2
  3. 3. Mapping the French infoboxes  The set of mappings has grown significantly during the last three years (2012-2015) 208 infoboxes have mappings I contribute to 100 mappings This amounts to 50% of the articles in the French Wikipedia which have an infobox  Example: Infobox Communes de France (mapping): 36765 occurrences Infobox Musique (œuvre) (mapping): 29429 occurrences 09/02/2015 - 3rd DBpedia Community Meeting – Dublin, Ireland - 3
  4. 4. How is DBpedia FR used at Orange?  Used as a knowledge graph for the in-house Web search engine  Used to interlink background knowledge with internal data about films (AlloCine) and music (Deezer)  Used as a knowledge provider for public tools in IPTV  Used for recommendation system in VOD service 09/02/2015 - 3rd DBpedia Community Meeting – Dublin, Ireland - 4
  5. 5. Presentation of the Orange challenge  Team members: Guillaume Viland Jonathan Marchand Julien Plu  Internal challenge for getting new research projects  Only two weeks to get something to present 09/02/2015 - 3rd DBpedia Community Meeting – Dublin, Ireland - 5
  6. 6. Project : ExtSem  Goal: extracting relations among named entities in raw text  Example: L'excentrique Lady Gaga est au coeur de l'actu depuis qu'elle a dévoilé son single "Applause" issu de son quatrième album à découvrir à partir du 11 novembre.  Results: Subject predicate object Lady Gaga etre aucoeurdeactu Lady Gaga devoiler Applause (chanson) 09/02/2015 - 3rd DBpedia Community Meeting – Dublin, Ireland - 6
  7. 7. Module 1: ParseText .txt Tokenizer et PoS Tagger : Melt .conll06 .inmalt Parser : MaltParser • Part of Speech Tagger and Parser are stochastic and trained with the French Dependency Treebank • Deep syntactic analysis with dependencies 09/02/2015 - 3rd DBpedia Community Meeting – Dublin, Ireland - 7
  8. 8. Module 2: BuildDepGraph .conll06 .nerd buildDe pGraph .depnt • This module merges the output from the NERD framework with the syntactic analysis • The output is in RDF modeled with a vocabulary mapped on French POS tags 09/02/2015 - 3rd DBpedia Community Meeting – Dublin, Ireland - 8
  9. 9. Module 3: ExtractRDF  .depnt example .depnt extractRdf .fullnt 09/02/2015 - 3rd DBpedia Community Meeting – Dublin, Ireland - 9
  10. 10. Module 4: selectRDF .fullnt selectRd f .nt • This module enables to select the triples who has a URI as subject • One can also customize this module according to a topic to map the predicate to properties from well-known vocabularies 09/02/2015 - 3rd DBpedia Community Meeting – Dublin, Ireland - 10
  11. 11. Experiments  We have processed, for one month, the (480) daily articles from the “Closer” Magazine.  Some statistics: 2800 triples extracted 971 distinct entities 657 distinct predicates At least 4 triples extracted per articles  Qualitative analysis: 57% of the triples are about relationship between celebrities (wedding, cheating, rumors, etc.) 43% of the triples are about diverse topics such as sport, fashion or politics 09/02/2015 - 3rd DBpedia Community Meeting – Dublin, Ireland - 11
  12. 12. Conclusion  Good results for two weeks of work (3rd position on 7 participants for this challenge)  The idea behind this project has been taken by Orange Labs for being exploited  Possible evolutions: Automatic mapping of the predicates Add more grammar rules to get more triples Improve the performance (slow and long process) Machine learning algorithm to classify which triple can be useful (interesting) or not. 09/02/2015 - 3rd DBpedia Community Meeting – Dublin, Ireland - 12

×