03 interlinking-dass


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Arquivos são linkados de arquivos HTML para HTML ou outros documentos.
    Eles são os dados que você pode interligar.
  • Estes formatos de dados são formados para consumo humano.
    Requer uma ferramenta especializada para automatizar acesso, busca e reuso.
    Processamento adicional é necessário para que estes dados possam ser incorporados em novos projetos.
  • 1- Dado disponivel na web (em qualquer formato, até imagem escaneada)
    2- Disponivel num formato legivel por maquina (ex.: excel)
    3- Disponivel em formato nao proprietario (CSV)
    4- Publicado usando os padroes da W3c
    5- Todos acima com links para outros dados
  • is the most important Linked Data principle as it enables the paradigm change from data silos to interoperable data distributed across the Web.
    Furthermore, it plays a key role in important tasks such as cross-ontology question answering, large-scale inferences and data integration.
  • Important to notice that while ontology and instance matching are similar to schema matching [127,126] and record linkage [161,37,25] respectively (as known in the research area of databases).
  • The time complexity of a matching task can be measured by the number of comparisons necessary to complete this task
    Reduction of the time complexity of link discovery is a key requirement to instance linking frameworks for Linked Data.

  • RKB extract RDF from heterogeneous data source so as to populate its knowledge bases with instances according to the AKT ontology
    instances of persons, publications and institutions were retrieved from several major metadata websites such as ACM and DBLP
    RDF-AI contains a series of modules that allow for computing instances matches by comparing their properties.
    RDF-AI does not comprise means for querying distributed data sets via SPARQL15. In addition, it suffers from not being time-optimized. Thus, mapping by using this tool can be very time-consuming.
    LIMES can make use of the fact that the edit distance is a distance metric to approximate distances without having to compute them.

  • 03 interlinking-dass

    1. 1. Linked Data Interlinking Diego Pessoa derp@cin.ufpe.br
    3. 3. World Wide Web is ful
    4. 4. “These data are formated for human consumption!” And the machine
    5. 5. Linked Data mug
    6. 6. Linked Data Lifecycle
    8. 8. Interlinking - Definition Interlinking refers to the degree to which entities that represent the same concept are linked to each other. Introduction to linked data and its lifecycle on the web (Auer, Sören Lehmann, Jens Ngomo, CAN Zaveri, Amrapali) “connecting things that are somehow related”
    9. 9. Interlinking - Definition Metrics. Interlinking can be measured by - Using network measures that calculate the interlinking degree - Cluster coefficient - SameAs chains - Centrality and description richness through sameAs links. Airline dataset Spatial Dataset URI: americaairlines.com/country/America URI: dbpedia.org/page/United_StatesSameAs
    10. 10. Why Interlinking? “Include links to other URIs, so that they can discover more things” 4th principle of LD (most important) The goal of linking is to transform the Web into a platform for data and information integration as well as for search and querying. Triples in Linked Data sources > 31 billions -> Links consititute less than 5% of these triples
    11. 11. LINK DISCOVERY Two categories of frameworks: Linking on the Web of Data is a more generic and thus more complex task, as it is not limited to finding equivalent entities in two knowledge bases Frameworks have been developed to address the lack of links between the different knowledge bases on the web. 1) Ontology matching: establish links between ontologies underlying two data sources. 2) Instance matching (link discovery): discover links between instances contained in two data sources.
    12. 12. LINK DISCOVERY Formally… Given Two sets S (source); T (target) of instances, a (complex) semantic similarity measure σ : S × T → [0, 1] and a threshold θ ∈ [0, 1] The goal of link discovery task is to compute the set M = {(s, t), σ(s, t) ≥ θ}. In general, the similarity function used to carry out a link discovery task is described by using a link specification (sometimes called linkage decision rule).
    13. 13. CHALLENGES Two key challenges arise when trying to discover links between two sets of instances: 1) computational complexity of the matching task 2) selection of an appropriate link specification.
    14. 14. CHALLENGES 1) Computational complexity of the matching task • The time complexity of a matching task can be measured by the number of comparisons necessary to complete this task • Reduction of the time complexity of link discovery is a key requirement to instance linking frameworks for Linked Data. Ex.: discovering duplicate cities in Dbpedia would necessitate approximately 0.15 × 109 similarity computations.
    15. 15. CHALLENGES 2) Selection of an appropriate link specification. • The configuration of link discovery frameworks is usually carried out manually, in most cases simply by guessing • Methods such as supervised and active learning can be used to guide the user in need of mapping to a suitable linking configuration for his matching task
    16. 16. APPROACHES TO LINK DISCOVERYCurrent frameworks for link discovery can be subdivided into two main categories: Domain-specific Universal • RKBExplorer (academic purposes) • GNAT (music) • RDF-AI (not time optimized) • LIMES (time optimized) • SILK
    17. 17. ACTIVE LEARNING OF LINK SPECIFICATIONSThe second challenge of Link Discovery is the time-efficient discovery of link specifications for a particular linking tasks. Several approaches have been proposed to achieve this goal, of which most rely on genetic programming COALA (Correlation-Aware Active Learning) approach was implemented on top of the genetic programming
    18. 18. CONCLUSIONS • First works on running link discovery in parallel have shown that using massively parallel hardware such as GPUs can lead to better results that using cloud implementations even on considerably large datasets. • Detecting the right resources for linking automatically given a hardware landscape is yet still a dream to achieve.
    19. 19. CURRENT CHALLENGES • Authoring • Extraction from structured fonts (RDBS, XML) • Natural Language Queries • Automatic Management of Resources for Linking • Linked Data Visualization • Linked Data Quality/Reliability
    20. 20. Main References[Book] Linked Data – Structured data on the web. David Wood, Marsha Zaidman, Luke Ruth. 2014 [Paper] Linked Data – The story so far. Berners Lee.
    21. 21. Linked Data Interlinking Diego Pessoa derp@cin.ufpe.br Thanks!