0
© Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.der...
Digital Enterprise Research Institute www.deri.ie
WikipediaWikipedia is one of the widest-known knowledge bases available ...
Digital Enterprise Research Institute www.deri.ie
• By representing Wikipedia provenance information with Semantic WebBy r...
Digital Enterprise Research Institute www.deri.ie
TheThe SIOCSIOC CoreCore ontology:ontology:
http://rdfs.org/sioc/spechtt...
Digital Enterprise Research Institute www.deri.ie
• From aFrom a document-centricdocument-centric (SIOC)(SIOC) to anto an ...
Digital Enterprise Research Institute www.deri.ie
• Ontological model created to describe the semantics of data provenance...
Digital Enterprise Research Institute www.deri.ie
1 – What1 – What
AnAn eventevent (i.e. change of state) that happens to ...
Digital Enterprise Research Institute www.deri.ie
• 2 – How2 – How
TheThe actionaction leading to an event.leading to an e...
Digital Enterprise Research Institute www.deri.ie
3 – When3 – When
TheThe timetime an event occurs.an event occurs.
• In W...
Digital Enterprise Research Institute www.deri.ie
4 – Where4 – Where
The onlineThe online spacespace or the location assoc...
Digital Enterprise Research Institute www.deri.ie
Our modelling solutionOur modelling solution
11 of 23
5 – Who5 – Who
AnA...
Digital Enterprise Research Institute www.deri.ie
Our modelling solutionOur modelling solution
12 of 23
6 – Which6 – Which...
Digital Enterprise Research Institute www.deri.ie
Our modelling solutionOur modelling solution
13 of 23
7 – Why7 – Why
The...
Digital Enterprise Research Institute www.deri.ie
Our modelling solutionOur modelling solution
14 of 23
Digital Enterprise Research Institute www.deri.ie
Application using Wikipedia provenance dataApplication using Wikipedia p...
Digital Enterprise Research Institute www.deri.ie
Data CollectionData Collection
A PHP script has been developed to extrac...
Digital Enterprise Research Institute www.deri.ie
Data CollectionData Collection
17 of 23
Digital Enterprise Research Institute www.deri.ie
A Firefox plug-inA Firefox plug-in
• This application displays a table d...
Digital Enterprise Research Institute www.deri.ie
A Firefox plug-inA Firefox plug-in
19 of 23
Digital Enterprise Research Institute www.deri.ie
To the Web of dataTo the Web of data
• The application is currently avai...
Digital Enterprise Research Institute www.deri.ie
To the Web of dataTo the Web of data
• As an example the following tripl...
Digital Enterprise Research Institute www.deri.ie
Conclusions and Future WorkConclusions and Future Work
Our contributionO...
Digital Enterprise Research Institute www.deri.ie
Applications and source code:Applications and source code:
http://vmuss0...
Upcoming SlideShare
Loading in...5
×

Semantic Representation of Provenance in Wikipedia

903

Published on

presented @ISWC2010 - SWPM (Sem Web Provenance Management) workshop

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
903
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Semantic Representation of Provenance in Wikipedia"

  1. 1. © Copyright 2009 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.ie Semantic Representation of Provenance in Wikipedia Fabrizio Orlandi¹, Pierre-Antoine Champin², Alexandre Passant¹ SWPM 2010 Shanghai – 7th Nov 2010 ¹ Digital Enterprise Research Institute – National University of Ireland, Galway ² LIRIS, Université de Lyon, CNRS, UMR5205, Lyon
  2. 2. Digital Enterprise Research Institute www.deri.ie WikipediaWikipedia is one of the widest-known knowledge bases available on the Webis one of the widest-known knowledge bases available on the Web Everyone can contributeEveryone can contribute TrustTrust andand qualityquality concerns!concerns! Use ofUse of provenanceprovenance information to identify trust and quality values for pagesinformation to identify trust and quality values for pages MotivationMotivation 2 of 23 Data Provenance as theData Provenance as the historyhistory, the, the originsorigins and theand the evolutionevolution of data.of data. Ability to answer the following questions about data:Ability to answer the following questions about data: WhoWho created/modified it?created/modified it? WhenWhen?? WhatWhat is the content?is the content? WhereWhere is it located?is it located? HowHow andand WhyWhy was it created?was it created? WhichWhich tools and processes were used?tools and processes were used?
  3. 3. Digital Enterprise Research Institute www.deri.ie • By representing Wikipedia provenance information with Semantic WebBy representing Wikipedia provenance information with Semantic Web technologies we enable:technologies we enable: – TransparencyTransparency – ReusabilityReusability – Integration with the Web of DataIntegration with the Web of Data • Our contribution:Our contribution: – A semantic model to represent provenance information in wikisA semantic model to represent provenance information in wikis – A software architecture to extract provenance from WikipediaA software architecture to extract provenance from Wikipedia – An application that uses and exposes provenance data to computeAn application that uses and exposes provenance data to compute measures and statistics on Wikipedia articlesmeasures and statistics on Wikipedia articles 3 of 23 Semantic provenance in WikipediaSemantic provenance in Wikipedia
  4. 4. Digital Enterprise Research Institute www.deri.ie TheThe SIOCSIOC CoreCore ontology:ontology: http://rdfs.org/sioc/spechttp://rdfs.org/sioc/spec 4 of 23 • WikiWiki andand WikiArticleWikiArticle classes with theclasses with the SIOCSIOC TypesTypes module.module. AdvantagesAdvantages of using SIOC:of using SIOC: • Widely used on the Web.Widely used on the Web. • IntegrationIntegration with existing SIOC data and other popular lightweight ontologies like FOAF, DC, etc.with existing SIOC data and other popular lightweight ontologies like FOAF, DC, etc. • Same queries to find items on aSame queries to find items on a WikiWiki or aor a BlogBlog,, ForumForum, etc., etc. SIOCSIOC Semantically-Interlinked Online CommunitiesSemantically-Interlinked Online Communities Describes the content andDescribes the content and structure of community sites.structure of community sites.
  5. 5. Digital Enterprise Research Institute www.deri.ie • From aFrom a document-centricdocument-centric (SIOC)(SIOC) to anto an action-centricaction-centric (SIOC Actions)(SIOC Actions) view of onlineview of online communities.communities. [Champin, Passant – 2010][Champin, Passant – 2010] • It represents the dynamics of online communities, how they evolve:It represents the dynamics of online communities, how they evolve: – A set ofA set of actionsactions, performed by a, performed by a useruser at someat some timetime, impacting one or more, impacting one or more objectsobjects.. – In Wikipedia actions areIn Wikipedia actions are editsedits made by users on the articles.made by users on the articles. Relies on theRelies on the Event OntologyEvent Ontology [Raimond et al. - 2007][Raimond et al. - 2007] http://motools.sourceforge.net/event/event.htmlhttp://motools.sourceforge.net/event/event.html The SIOCThe SIOC Actions moduleActions module 5 of 23
  6. 6. Digital Enterprise Research Institute www.deri.ie • Ontological model created to describe the semantics of data provenanceOntological model created to describe the semantics of data provenance [Ram, Liu - 2007][Ram, Liu - 2007] – Based on the Bunge's ontology (Based on the Bunge's ontology (19771977).). – Tracks theTracks the historyhistory of theof the eventsevents affecting the status ofaffecting the status of thingsthings duringduring theirtheir lifcyclelifcycle.. – Extensible and generic, it can be used in different domains.Extensible and generic, it can be used in different domains. – 7 interrogative words:7 interrogative words: WhatWhat,, HowHow,, WhenWhen,, WhereWhere,, WhoWho,, WhichWhich,, WhyWhy.. – Not implemented in RDFS/OWL.Not implemented in RDFS/OWL. The W7 ModelThe W7 Model 6 of 23
  7. 7. Digital Enterprise Research Institute www.deri.ie 1 – What1 – What AnAn eventevent (i.e. change of state) that happens to data during its life time(i.e. change of state) that happens to data during its life time In Wikipedia every type of event (In Wikipedia every type of event (creation, modification, deletioncreation, modification, deletion) leads to) leads to thethe creation of a new article revisioncreation of a new article revision.. Just using SIOC Core we can modelJust using SIOC Core we can model versioningversioning and history of wiki articles.and history of wiki articles. Our modelling solutionOur modelling solution 7 of 23 <http://example.com/action?title=Linked_Data#38010613> sioca:creates <http://en.wikipedia.org/w/index.php?title=Linked_Data&oldid=38010613>; sioca:modifies <http://en.wikipedia.org/wiki/Linked_Data>; a sioca:Action.
  8. 8. Digital Enterprise Research Institute www.deri.ie • 2 – How2 – How TheThe actionaction leading to an event.leading to an event. • In Wikipedia the actions are theIn Wikipedia the actions are the editsedits applied to the articles.applied to the articles. • By analyzingBy analyzing diffsdiffs between revisions we identify thebetween revisions we identify the type of actiontype of action involvedinvolved in the creation of the newer revisionin the creation of the newer revision (( InsertionInsertion || UpdateUpdate || DeletionDeletion ) () ( SentenceSentence || ReferenceReference )) • To model the differences between revisions we created a lightweightTo model the differences between revisions we created a lightweight DiffDiff ontologyontology that aims at describingthat aims at describing changes to plain text documentschanges to plain text documents.. (http://vocab.deri.ie/diff#)(http://vocab.deri.ie/diff#) Our modelling solutionOur modelling solution 8 of 23
  9. 9. Digital Enterprise Research Institute www.deri.ie 3 – When3 – When TheThe timetime an event occurs.an event occurs. • In Wikipedia every edit has a timestamp recorded, and edits areIn Wikipedia every edit has a timestamp recorded, and edits are considered instantaneous.considered instantaneous. • Use ofUse of dc:createddc:created oror event:timeevent:time Our modelling solutionOur modelling solution 9 of 23 <http://example.com/action?title=Linked_Data#380106133> dc:created "2010-08-21T06:36:17Z"; event:time [ a time:Instant; time:inXSDDateTime "2010-08-21T06:36:17Z". ]; a sioca:Action.
  10. 10. Digital Enterprise Research Institute www.deri.ie 4 – Where4 – Where The onlineThe online spacespace or the location associated with an event.or the location associated with an event. In Wikipedia the information about the location of the user editing theIn Wikipedia the information about the location of the user editing the page is not provided.page is not provided. This information cannot be modelled.This information cannot be modelled. Our modelling solutionOur modelling solution 10 of 23
  11. 11. Digital Enterprise Research Institute www.deri.ie Our modelling solutionOur modelling solution 11 of 23 5 – Who5 – Who AnAn agentagent involved in an event.involved in an event. In Wikipedia it is represented by theIn Wikipedia it is represented by the editoreditor of a page.of a page. We use theWe use the sioc:UserAccountsioc:UserAccount class to identify the account of the agentclass to identify the account of the agent <http://example.com/action?title=Linked_Data#36243686> sioc:has_creator <http://en.wikipedia.org/wiki/User:Timbl>; a sioca:Action.
  12. 12. Digital Enterprise Research Institute www.deri.ie Our modelling solutionOur modelling solution 12 of 23 6 – Which6 – Which The programs orThe programs or instrumentsinstruments used in the event.used in the event. • In Wikipedia it is represented by the MediaWiki software used to edit theIn Wikipedia it is represented by the MediaWiki software used to edit the articles.articles. • Different in case the editor is a “bot”.Different in case the editor is a “bot”.
  13. 13. Digital Enterprise Research Institute www.deri.ie Our modelling solutionOur modelling solution 13 of 23 7 – Why7 – Why TheThe reasonsreasons behind the event occurrence.behind the event occurrence. • In Wikipedia it is defined by the justifications for a change inserted by aIn Wikipedia it is defined by the justifications for a change inserted by a user in theuser in the “comment”“comment” field.field. • PropertyProperty diff:commentdiff:comment with thewith the diff:Diffdiff:Diff class as domain.class as domain.
  14. 14. Digital Enterprise Research Institute www.deri.ie Our modelling solutionOur modelling solution 14 of 23
  15. 15. Digital Enterprise Research Institute www.deri.ie Application using Wikipedia provenance dataApplication using Wikipedia provenance data The application is composed mainly in 3 parts:The application is composed mainly in 3 parts: • Data CollectionData Collection – Extracts and generates provenance data from Wikipedia using our model.Extracts and generates provenance data from Wikipedia using our model. • Firefox plug-inFirefox plug-in – From the provenance data collected, it computes and shows statisticalFrom the provenance data collected, it computes and shows statistical information directly on Wikipedia pages.information directly on Wikipedia pages. • Exposing the data to the Web of dataExposing the data to the Web of data – The statistical information and the provenance data are provided asThe statistical information and the provenance data are provided as Linked Open Data.Linked Open Data. 15 of 23
  16. 16. Digital Enterprise Research Institute www.deri.ie Data CollectionData Collection A PHP script has been developed to extract all the articles belonging to aA PHP script has been developed to extract all the articles belonging to a categorycategory and all its subcategories, and for each article, its entireand all its subcategories, and for each article, its entire revision historyrevision history.. Then the program extracts provenance information from the articles collected atThen the program extracts provenance information from the articles collected at the previous step: it calculates thethe previous step: it calculates the diffdiff function between versions and retrievesfunction between versions and retrieves other information from the Wikipedia API.other information from the Wikipedia API. We ran our experiment with theWe ran our experiment with the “Semantic Web”“Semantic Web” category and all itscategory and all its 166166 Wikipedia articles. All the data has been loaded in a RDF store.Wikipedia articles. All the data has been loaded in a RDF store. 16 of 23
  17. 17. Digital Enterprise Research Institute www.deri.ie Data CollectionData Collection 17 of 23
  18. 18. Digital Enterprise Research Institute www.deri.ie A Firefox plug-inA Firefox plug-in • This application displays a table directly on top of Wikipedia articlesThis application displays a table directly on top of Wikipedia articles exposing information about the most active users and their edits.exposing information about the most active users and their edits. • It is composed by:It is composed by: – 1) The1) The triplestoretriplestore, exposing a SPARQL endpoint;, exposing a SPARQL endpoint; – 2) A2) A PHP scriptPHP script, which queries the triplestore and sends the results to, which queries the triplestore and sends the results to the Greasemonkey script;the Greasemonkey script; – 3) A3) A Greasemonkey scriptGreasemonkey script, which retrieves the URL of the Wikipedia, which retrieves the URL of the Wikipedia loaded page, sends the request to the PHP script and then displays theloaded page, sends the request to the PHP script and then displays the returned HTML data on the Wikipedia page.returned HTML data on the Wikipedia page. 18 of 23
  19. 19. Digital Enterprise Research Institute www.deri.ie A Firefox plug-inA Firefox plug-in 19 of 23
  20. 20. Digital Enterprise Research Institute www.deri.ie To the Web of dataTo the Web of data • The application is currently available atThe application is currently available at http://vmuss06.deri.ie/WikiProvenance/index.phphttp://vmuss06.deri.ie/WikiProvenance/index.php.. • Using this web service is possible to have RDF for the provenance dataUsing this web service is possible to have RDF for the provenance data generated with our model.generated with our model. • It is also possible to have the statistical information displayed with theIt is also possible to have the statistical information displayed with the Firefox plugin represented in RDF.Firefox plugin represented in RDF. • To represent the statistics we use SCOVO, the Statistical Core VocabularyTo represent the statistics we use SCOVO, the Statistical Core Vocabulary (http://vocab.deri.ie/scovo)(http://vocab.deri.ie/scovo) 20 of 23
  21. 21. Digital Enterprise Research Institute www.deri.ie To the Web of dataTo the Web of data • As an example the following triples represent that:As an example the following triples represent that: the user “KingsleyIdehen” made 11 edits on the SIOC pagethe user “KingsleyIdehen” made 11 edits on the SIOC page 21 of 23 @prefix WikiStats: <http://vmuss06.deri.ie/WikipediaStats.owl#>. @prefix scovo: <http://purl.org/NET/scovo#>. <WikiStats:title=SIOC&user=KingsleyIdehen&edits> a scovo:Item ; rdf:value 11 ; scovo:dimension WikiStats:Edits ; scovo:dimension <http://wikipedia.org/wiki/SIOC>; scovo:dimension <http://wikipedia.org/wiki/User:KingsleyIdehen>.
  22. 22. Digital Enterprise Research Institute www.deri.ie Conclusions and Future WorkConclusions and Future Work Our contributionOur contribution: • A specific lightweight ontology for provenance in wikis, based on the W7 model and SIOC.A specific lightweight ontology for provenance in wikis, based on the W7 model and SIOC. • A framework for the extraction of provenance data from Wikipedia.A framework for the extraction of provenance data from Wikipedia. • An application to access the generated data in a meaningful way and to expose it to theAn application to access the generated data in a meaningful way and to expose it to the Web of data.Web of data. Future work:Future work:  A refinement of the proposed model and anA refinement of the proposed model and an alignmentalignment with other general-purposewith other general-purpose ontologies for provenance representation.ontologies for provenance representation.  To improve theTo improve the performanceperformance and extend theand extend the featuresfeatures of the application.of the application.  To model statistics using theTo model statistics using the SDMXSDMX vocabularyvocabulary (Statistical Data and Metadata eXchange)(Statistical Data and Metadata eXchange) 22 of 23 CommentComment: • VeryVery large amount of datalarge amount of data generated for the “Semantic Web” category and its 166generated for the “Semantic Web” category and its 166 articles: almost 1.5 million triples for a total of 8.656 revisions.articles: almost 1.5 million triples for a total of 8.656 revisions.
  23. 23. Digital Enterprise Research Institute www.deri.ie Applications and source code:Applications and source code: http://vmuss06.deri.ie/WikiProvenance/index.phphttp://vmuss06.deri.ie/WikiProvenance/index.php The Diff ontology:The Diff ontology: http://vocab.deri.ie/diffhttp://vocab.deri.ie/diff## Contacts:Contacts: fabrizio.orlandi@deri.orgfabrizio.orlandi@deri.org @BadmotorF@BadmotorF http://www.slideshare.net/badmotorfingerhttp://www.slideshare.net/badmotorfinger 23 of 23 Questions ?Questions ?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×