20110728 datalift-rpi-troy

  • 1,047 views
Uploaded on

 

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,047
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
11
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Datalift ProjectOntologies, Datasets, Tools and Methodologiesto Publish and Interlink ★★★★★ Datasets François Scharffe University of Montpellier, LIRMM, INRIA francois.scharffe@lirmm.fr @lechatpitoWith the help of the Datalift teamAnd the support of the French National Research Agency RPI 28/07/2011 1
  • 2. State of government open data(September 2010…) You’re here
  • 3. State of government open data(June 2011)
  • 4. April 2008 September 2008May 2007 Linking Open Data March 2009 September 2010 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  • 5. Linked dataLink theworld
  • 6. W3C
  • 7. W3C
  • 8. principles§ Use the RDF format§ Use URI to name things§ Use HTTP URI HTTP (URL) so that one can look up those names§ Give information (HTML, RDF) when dereference those links§ Include in this information other URIs pointing to other data to enable discovery Tim Berners Lee, http://www.w3.org/DesignIssues/LinkedData.html
  • 9. goal of dataliftfrom raw published datato interconnected semantic data
  • 10. phase 1: opening the data develop a plateform easing the publication
  • 11. Welcome aboard the data lift Published and interlinked data on the Web Applications InterconnexionPublication infrastructure Data convertion Vocabulary selection Raw data
  • 12. Example publication process Environmental, weather, geological datasets SPARQL Content Negociation URI de-referencing Oil industry Geography equipment
  • 13. st 1 floor - SelectionSemWebPro 18/01/2011 13
  • 14. Vocabularies of my friends...Ø What is a (good) vocabulary for linked data ? § Usability criterias Simplicity, visibility, sustainability, integration, coherence …Ø Differents types of vocabularies § metadata, reference, domain, generalist … § The pillars of Linked Data : Dublin Core, FOAF, SKOSØ Good and less good practices § Ex : Programmes BBC vs legislation.gov.uk § Vocabulary of a Friend : networked vocabulariesØ Linguistic problems § Existing vocabularies are in English at 99% § Terminological approach :which vocabularies for « Event » « Organization »
  • 15. Did you say « vocabulary »… And why not « ontology »? § « schema » or « metadata schema »? § Or « model » (data ? World ?)Ø All these terms are used and justifiableThey are all « vocabularies » § They define types of objects (or classes) and the properties (or attributes) atttached to these objects. § Types and attributes are logically defined and named using natural language § A (semantic) vocabulary is an explicit formalization of concepts existing in natural language 15
  • 16. Vocabularies for linked dataØ Are meant to describe resources in RDFØ Are based on one of the standard W3C language § RDF Schema (RDFS) • For vocabulaires without too much logical complexity § OWL • For more complex ontological constructs § These two languages are compatible (almost)Ø The can be composed « ad libitum » § One can reuse a few elements of a vocabulary § The original semantics have to be followed
  • 17. What makes a good vocabulary ?Ø A good vocabulary is a used vocabulary § Data published on CKAN give an idea of vocabulary usage § Exemple : list of datasets using FOAF http://xmlns.com/foaf/0.1/Ø Other usability criterias § Simplicity and readability in natural language § Elements documentation (definition in natural language) § Visibility and sustainability of the publication § Flexibility and extensibility § Sémantic integration (with other vocabularies) § Social integration (with the user community)
  • 18. A vocabulary is also a communityØ Bad (but common) practice ● Build a lonely vocabulary – For example as a research project – Without basing it on any existing vocabulary § To publish it (or not) and then to forget about it § Not to care about its usersØ A good vocabulary has an organic life § Users and use cases § Revisions and extensions § Like a « natural » vocabulary
  • 19. Types of vocabulariesØ Metadata vocabularies § Allowing to annotate other vocabularies • Dublin Core, Vann, cc REL, Status, VoidØ Reference vocabularies § Provide « common » classes and properties • FOAF, Event, Time, Org OntologyØ Domain vocabularies § Specific to a domain of knowledge • Geonames, Music Ontology, WildLife OntologyØ « general » vocabularies § Describe « everything » at an arbitrary detail level • DBpedia Ontology, Cyc Ontology, SUMO
  • 20. Vocabulary of a FriendØ http://www.mondeca.com/foaf/voafØ A simple vocabulary...Ø To represent interconnexions between vocabulariesØ A unique entry point to vocabularies and Datasets of the linked-data cloud Linked Data CloudØ Ongoing work in Datalift
  • 21. nd 2 floor - ConversionSemWebPro 18/01/2011 21
  • 22. Reference datasets, URI design● Providing reference datasets for the French ecosystem: geographical, topological, statistical, political● Providing URI design guidelines ● Opaque or transparent URIs ? ● Usage of accents in URIs ● Distinction betweenResources: http://dbpedia.org/resource/ParisDocuments: http://dbpedia.org/page/ParisData: http://dbpedia.org/data/Paris… All served with content negociation
  • 23. Many tools exist ! csv2rdf4lod
  • 24. Direct Mapping from relational database to RDFDefine a standard transformation from a relationaldatabase to RDFThe relational schema is used : • Cells of a tuple produce triples with a common subject • Each cell produces an object • Different tables of a same database are thus linked together Standard automatic translation of any relational schema to RDF,based on the database DumpThen we can SPARQL CONSTRUCT to adapt vocabularies andURIs.
  • 25. ExempleCredits Ivan Herman: http://ivan-herman.name/2010/11/19/my-first-mapping-from-direct-mapping/ 25
  • 26. Exemple @base <http://book.example/> . <Book/ID=0006511409X#_> a <Book> ; <Book#ISBN> "0006511409X" ; <Book#Title> "The Glass Palace" ; <Book#Year> "2000" ; <Book#Author> <Author/ID=id_xyz#_> . <Author/ID=id_xyz#_> a <Author> ; <Author#ID> "id_xyz" ; <Author#Name> "Ghosh, Amitav" ; <Author#Homepage> "http://www.amitavghosh.com" . Simple result but not satisfaying: ● we want to use different vocabulary terms (like a:name) ● the direct mapping produces literal objects most of the time, except when there is a “jump” from one table to another ● the resulting graph should use a blank node for the author, which is not the case in the generated graphCredits Ivan Herman: http://ivan-herman.name/2010/11/19/my-first-mapping-from-direct-mapping/ 26
  • 27. ExempleSolution : use SPARQL 1.1 Construct queriesCONSTRUCT { ?id a:title ?title ; a:year ?year ; a:author _:x . _:x a:name ?name ; a:homepage ?hp .}WHERE { SELECT (IRI(fn:concat("http://...",?isbn)) AS ?id) ?title ?year ?name (IRI(?homepage) AS ?hp){ ?book a <Book> ; <Book#ISBN> ?isbn ; <Book#Title> ?title ; <Book#Year> ?year ; <Book#Author> ?author . ?author a <Author> ; <Author#Name> ?name ; <Author#Homepage ?homepage . } 27
  • 28. rd 3 floor - PublicationSemWebPro 18/01/2011 28
  • 29. Datalift PlatformV1 to be released in September with expected features :- Modular architecture- Raw convertion module: Relational DB (DirectMapping approach, CSV,XML (based on a user specified XSLT transformation)- Selection module : LOV repository, automatic candidate vocabularyproposal using ontology matching from the raw data schema, vocabularynavigation tool, vocabulary usage metrics, sample data for each vocab- Convertion (according to the schema) : RDF2RDF Convertion modulebased on SPARQL construct (manual editing), Vocabulary mappingfacility (textual)- Interlinking and Alignment : A Silk interface -- Integration of thealignment API- Publication Sesame API, informational vs non-informational resource 29management.
  • 30. Datalift Platform
  • 31. th 4 floor - InterconnexionSemWebPro 18/01/2011 31
  • 32. Web of data and links- Without links no web but data silos- Many types of links : the edges of the Web of data graph are labeled- Some links are built during the selection phase : reference datasets- We study here a particular type of links : equivalence links. 32
  • 33. owl:sameAs- points to a logical identity between two resource- The quality of the available links is not always optimalOther types of links : owl:differentFrom, rdfs:seeAlso 33
  • 34. How to link data ? 34
  • 35. How to link data ? 35
  • 36. How to link data ? 36
  • 37. How to link data ? 37
  • 38. How to link data ? 38
  • 39. Example Silk link specification<Silk> <Interlink id="cities"> <Prefix id="rdfs" namespace= <LinkType>owl:sameAs</LinkType> "http://www.w3.org/2000/01/rdf-schema#" /> <SourceDataset dataSource="dbpedia" var="a"> <Prefix id="dbpedia" namespace= <RestrictTo> "http://dbpedia.org/ontology/" /> ?a rdf:type dbpedia:City <Prefix id="gn" namespace= </RestrictTo> "http://www.geonames.org/ontology#" /> </SourceDataset> <TargetDataset dataSource="geonames" var="b"> <DataSource id="dbpedia"> <RestrictTo> <EndpointURI>http://demo_sparql_server1/sparql ?b rdf:type gn:P </EndpointURI> </RestrictTo> <Graph>http://dbpedia.org</Graph> </TargetDataset> </DataSource> <LinkCondition> <AVG> <DataSource id="geonames"> <Compare metric="jaroSimilarity"> <EndpointURI>http://demo_sparql_server2/sparql <Param name="str1" path="?a/rdfs:label" /> </EndpointURI> <Param name="str2" path="?b/gn:name" /> <Graph>http://sws.geonames.org/</Graph> </Compare> </DataSource> <Compare metric="numSimilarity"> <Param name="num1" <Thresholds accept="0.9" verify="0.7" /> path="?a/dbpedia:populationTotal" /> <Output acceptedLinks="accepted_links.n3" <Param name="num2" path="?b/gn:population" /> verifyLinks="verify_links.n3" </Compare> mode="truncate" /> </AVG> 39 </LinkCondition> </Interlink>
  • 40. Where to find links ? 40
  • 41. Towards automatic interlinkingWe have seen some of the Silk spec fields could be avoided- Using alignments between ontologies- Detecting discriminating properties- Indicating comparison methods by attaching metadata to ontologies-> … ongoing work in Datalift 41
  • 42. 5th floor - ApplicationsSemWebPro 18/01/2011 42
  • 43. phase 2: publishing datasets validate the plateform with real data
  • 44. Research objectives§ Methods and metrics for selecting schemas§ Tradeoff between specific and generic vocabularies§ Data conversion and URI design patterns§ Automatic data interlinking§ Provenance and rights management§ Integration, architecture and scalability
  • 45. Who ? W3 C © 2010-2013
  • 46. http://labs.mondeca.com/dataset/lov/index.html
  • 47. http://labs.mondeca.com/vocab/voaf/
  • 48. The french wider landscape● Regards Citoyens● Direction de l’information légale et administrative● Fédération des parcs naturels régionaux de France● Eurostat● Cities of Montpellier, Bordeaux, Rennes, …● Data Publica● EtatLab
  • 49. LIRMM D2R Serverhttp://data.lirmm.fr/nosdeputes/
  • 50. DATALIFT next floor: « the web of data »
  • 51. CreditsThis presentation was realized thanks to the work of the Datalift team.It can be freely distributed under Creative Commons licence BY-NC-SA 3.0 55