Linked Open data: CNR


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Linked Open data: CNR

  1. 1. and the Semantic Scout CNR Semantic Technology Lab ISTC - SIAldo Gangemi, Alberto Salvati, Enrico Daga, Gianluca TroianiThanks to Claudio Baldassarre (UN-FAO) and Alfio Gliozzo (IBM-Watson) 1
  2. 2. 2
  3. 3. Enhanced SPARQL endpoint 3
  4. 4. Ontologies 4
  5. 5. Sample class from ontology 5
  6. 6. The Semantic Scout• A framework for search, presentation, and analysis of entities and their associated knowledge• Employs SW, LOD, NLP, IR• Scientific work goes back to 2006, first presented at ISWC2007• An evolving prototype for requirements of the EU IP IKS: semantic search, hybrid IR/SW identity management, automatic document classification (against DBpedia)• 2009 requirements from the technology transfer office of CNR for the NetwOrK initiative 6
  7. 7. The CNR• CNR is the largest research institution in Italy – about 8000 permanent researchers (+14000) – 7 departments focused on the main scientific research areas – 108 institutes spread all over Italy • Subdivided into research units, labs, etc. 7
  8. 8. The CNR data sources Organizational data File System DB DB Administration DB Frameworks, Departments documentation Programmes, Workpackages Institutes, Central admin, Publications Activity-related data Only partly as open data! DB DBCurricula Permanent DB employees DB Financial data Accounting, Other Contracts, research Invoicing employees, Personnel-related data Externally funded projects 8
  9. 9. The CNR tasks• Strategic objective: matching the research demand to the research supply• Requirements – Semantic interoperability between heterogeneous data sources – Expert finding based on competence – Monitoring funding and evolution of different research areas and units – Browsing and reporting capabilities 9
  10. 10. Architecture 10
  11. 11. 11
  12. 12. Methods for data conversion, extraction, inference, integration, linking, publishing, and searching 12
  13. 13. Figures } 28 modules 120 classes CNR  Ontology 300 relations }1200 axioms>200K entities≈3M facts (about 2M inferred or extracted) CNR  Data≈240 datasets 13
  14. 14. Sources and lifting• Situation usually not as clean as using a unique CMS for most organizational tasks• DB (e.g. SQL Server) + a lot of textual records + HTML Web Site + textual corpus + linked open data• DB + interaction schemata (XML templates and HTML scraping, needed because of schemata degradation and user perspective evolution) 14
  15. 15. Ontology design• Starting from XML templates as module/pattern drafts• Reengineering XML and scraped templates• Reengineering DB schemata (system engineer involved)• Obtained modular, pattern-based, task-based ontology• Textual DB records with identity: precondition for hybridizing IR and SW (see later)• Alignments to FOAF, SIOC, SKOS, WordNet ontologies• Used patterns: situation, place, transitive reduction 15
  16. 16. The CNRontology 16
  17. 17. Data design• Triplifiers based on SQL rules (automatic scripting on JDBC drivers not enough because of legacy degradation of physical schemata) – Cf. also: Semion reengineering tool• Inferences: OWL (Pellet, HermiT), SPARQL CONSTRUCT• Extraction tool: Semiosearch, categorizer over Wikipedia categories – Next: deep parsing approach (facts, relations, entities) 17
  18. 18. Publishing and hybridizing• Publishing OWL-RDF datasets – linked data approach (persistent URIs, triple stores for RDF dataset management, linking to common vocabularies: FOAF, DBpedia, Geonames, Bibo, ...) – OWL ontologies for dataset generation, querying, inference (new enriched datasets)• Subgraph extraction through SNA• Virtual semantic corpus – IRW to distinguish information and non-information resources – SPARQL rules to generate virtual texts associated with entities• Indexing – Lucene+LSA indexing of semantic corpus – “Semantic” Lucene extension to produce tight coupling of virtual texts with entities – Multilinguality 18
  19. 19. Consuming• SPARQL endpoint, with interface enhancement• Keyword-based search – Semantic browsing with SPARQL-based AJAX DHTML, RDF relation browser, or XML-based relation browser• Category-based search – Keyword-based result focusing 19
  20. 20. 20
  21. 21. 21
  22. 22. 22
  23. 23. Expert finding: Task-based testing• It is based on the ability to materialize on demand a contextual network of relevant information.• It is performed with a combination of tools in the toolkit to: – Identify the main topics of research – Recursively search the CNR data cloud 23
  24. 24. Identifying the main topics of research: project description• “Reputation is a social knowledge, on which a number of social decisions are accomplished. Regulating society from the morning of mankind becomes more crucial with the pace of development of ICT technologies, dramatically enlarging the range of interaction and generating new types of aggregation. Despite its critical role, reputation generation, transmission and use are unclear. The project aims to an interdisciplinary theory of reputation and to modeling the interplay between direct evaluations and meta-evaluations in three types of decisions, epistemic (whether to form a given evaluation), strategic (whether and how interact with target), and memetic (whether and which evaluation to transmit).” – Project About: Social Knowledge for e-Governance. – Topics can be manually annotated, or automatically induced, e.g.: ethics, sociology, collaboration, social network, reputation 24
  25. 25. Identifying the main topics of research: text categorization• Query: “ethics, sociology, collaboration, social network, reputation” 25
  26. 26. Search the CNR data cloud: identify an entry point• “Commessa” (programme): “Il Circuito dell’Integrazione: Mente, Relazioni e Reti Sociali. Simulazione Sociale e Strumenti di Governance” 26
  27. 27. Search the CNR data cloud: identify key people• Ing. Jordi Sabater: Cognitive Science;• Dott. Mario Paolucci: Sociology, Psichology;• Gennaro di Tosto: Artificial Intelligence;• Walter Quattrociocchi: Interdisciplinary Fields;• Giuseppe Castaldi: Ethics; 27• Aldo Gangemi: Semantic Web, Knowledge representation.
  28. 28. Expert Finding: Results• The description of “eRep project” was adopted as a gold standard to evaluate the results when testing the Semantic Scout.• 6 out of 10 CNR researchers, were correctly retrieved and a project member affiliated with another institution. – Project Coordinator: Dott. Mario Paolucci – External Member: Jordi Sabater Mir 28
  29. 29. Functional evaluation of Semantic Scout (example)• Expert finding accuracy – All the 6 retrieved people scored among the first 10 in the result from the search engine.• Benefit of integrated data cloud – The user judged an “activity” to be relevant to his goal and used it as entry point to the CNR newtork of resources. 29
  30. 30. Functional evaluation of Semantic Scout• Accessibility and Interaction – Multiple users interfaces guarantee the users an adaptive level of interaction to each specific type of required information• Completeness of retrieval – 4 people have not been included in our result set. – Antonietta Di Salvatore: scored below the first 10 people in the list;(+1) – Giulia Andrighetto was not listed among the people relevant to the query, but belongs to the social network of Dr. Rosaria Conte.(+1) – Marco Capenni and Stefano Picascia: have a technician profile, hence they are neither reported among the people relevant to the search query, nor belong to the network of any of the other researchers. 30
  31. 31. Ongoing work• More data linking (e.g. DBLP, Georeferencing)• Synchronization with data sources• More interaction paradigms• Privacy issues interlaced with hierarchical and idiosyncratic practices 31
  32. 32. Conclusions• Hybridizing several semantic and retrieval technologies provides added value to a research organization• Scalability works for CNR figures• Interaction is a core selling point• Try it at• @data_cnr_it, @semanticscout, @aldogangemi 32