2011linked science4mccuskermcguinnessfinal


Published on

Linked Science 2011 talk on the importance of modeling sources and their usage, such as PML's source usage, and how it can be generalized using FRBR

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Challenges: It may not be easy to characterize the piece of information in larger information containers. Information sources are often sources of other pieces of information. Assertion of the piece of information is a point-time event that occurs during the life-time of the information source. Not any assertion event is a valid assertion. Valid assertions occur: During the lifespan of its source(s) In places where the sources are located. Before publishing a result, scientists need to check their facts. We stand on the shoulders of giants, but as we push forward in science, we need to make sure that we aren't standing on a giant house of cards. Knowing how, when and where your data comes from is critical for good science, and it's even more critical for linked science, where it isn't immediately clear where a database record or knowledge assertion came from. Sources of information become critical to evaluate information quality. It is difficult if not impossible to assess the trust of information, or to encode it as knowledge, without having a link between information and their sources. For example, one may want to know if the information came from a source such as the New York Times, and further, it may be useful to know the date, edition, page, and exact text fragment where the information was asserted. There are many challenges in the task of assigning a source to a piece of information. First, it may not be easy to characterize the piece of information in larger information containers (databases, printed documents, web documents, documents that require parameters from a system to be retrieved, etc). Second, the source of a piece of information is often a source of other pieces of information and should be referenced by an identifier and characterized elsewhere. Third, the assertion of the piece of information is a point-time event that occurs during the life-time of the information source. Thus, not any assertion event is a valid assertion: it needs to occur during the lifespan of its source(s) and in places where the sources are located. Those are all critical conditions that need to be properly captured in provenance languages if one wants to make proper use of source information in combination with linked data.
  • Example of other abstractions: Cell Lines vs Cell Line Colonies TODO: Add FRBR Example of PML Primer
  • Functional Requirements for Bibliographic References
  • A more generalized way to express Information and sources. Copy Events and Subset Events are used to describe accession (copy event) and quoting (subset event). The derivation of the quote from the content retrieved from the source is explicit. Quotes can happen at the Expression level (same offset in the text of an HTML as in plain text as in a Word Doc), versus at the Manifestation level. URL is Work, multiple URLs can be identified as the same Work if appropriate. Expressions can be seen to be the same if they have the same content. We’ve gathered up content digest algorithms for RDF graphs, tabular data, XML trees, and images. Others and domain-specific content digests are welcome. Manifestations are the same if the message digest (think MD5 or SHA-1) is the same. Items might be the same if they are actually the same physical copy.
  • Same stack of Work, Expression, Manifestation, Item. Multiple Works are redirected from one to another. The HTTP response (hash:Item/SHA256-DDD) is what the file on disk (filed://EEE/SHA356-FFF/us_economic_assistance.csv) is derived from. Tools in csv2rdf4lod create raw conversions of the spreadsheet to RDF (filed://GGG/SHA356-HHH/us_economic_assistance.csv.raw.ttl) which we then give a signature using a standalone graph hash tool (frbrstack.py), which confirms that even though the manifestations are different (BBB vs. CCC) the Expressions remain the same. Since frbrstack.py doesn’t know the original Work like pcurl.py does, it generates a new one. However, it is reasonable to assume that, in a case like this, where two items that have different manifestations but the same expressions AND there is knowledge that one Item is derived from another, that the Work remains the same. This may not be true for artwork, but is definitely true for information resources.
  • 2011linked science4mccuskermcguinnessfinal

    1. 1. Where did you hear that? Information and the Sources They Come From James P. McCusker, 1 Timothy Lebo, 1 Li Ding, 1 Cynthia Chang, 1 Paulo Pinheiro da Silva, 2 and Deborah L. McGuinness 1 1 Tetherless World Constellation 2 CyberShARE Center, University of Texas at El Paso Linked Science 2011, Bonn, Germany
    2. 2. Background <ul><li>To do Linked Science, you need to know where your data comes from. </li></ul><ul><li>Many studies show people will not use the applications if they do not have access to the information relied on </li></ul><ul><li>Studies done for DARPA (CALO), IARPA (NIMD), NSF (VSTO), AT&T (PROSE & FindUR), … </li></ul>
    3. 3. Selected Dimensions of Provenance <ul><li>Derivational Provenance </li></ul><ul><li>x derived_from y using rule z </li></ul><ul><li>agent initiated and controlled process </li></ul><ul><li>process used x </li></ul><ul><li>y generated_by process </li></ul><ul><li>subprocess triggered_by process </li></ul><ul><li>This work came from our work on provenance languages and their environments, in particular on </li></ul><ul><li>PML: Proof Markup Language </li></ul><ul><li>And </li></ul><ul><li>Inference Web </li></ul><ul><li>And now </li></ul><ul><li>PROV: Provenance Model (W3C, in progress) </li></ul><ul><li>One perspective on Abstractive Provenance: </li></ul><ul><li>Work </li></ul><ul><li>Expression </li></ul><ul><li>Manifestation </li></ul><ul><li>Item </li></ul><ul><li>FRBR: Functional Requirements for Bibliographic References </li></ul><ul><li>PROV: wasComplementOf </li></ul><ul><li>Exploring in another area of Cell Lines vs. Cell Line Colonies </li></ul>
    4. 4. FRBR Stack for PML Primer <ul><li>Work </li></ul><ul><li>The web page URL. </li></ul><ul><li>Expression </li></ul><ul><li>The page content </li></ul><ul><li>Manifestation </li></ul><ul><li>The bytes that were downloaded. </li></ul><ul><li>Item </li></ul><ul><li>The specific physical copy. </li></ul>
    5. 5. Information, Source, SourceUsage in PML <ul><li>PML provides an explicit mechanism for linking Information to its Source. </li></ul><ul><li>Includes: </li></ul><ul><li>Location (URI) </li></ul><ul><li>Date of access </li></ul><ul><li>Extraction method (file offsets, cell contents, etc). </li></ul>
    6. 6. Why FRBR for Information Sources? <ul><li>Work – Source </li></ul><ul><li>Expression – Abstract Information </li></ul><ul><li>Manifestation – Concrete Information </li></ul><ul><li>Item – Specific Copy </li></ul><ul><li>These constructs can be reused for other provenance: </li></ul><ul><ul><li>File copy </li></ul></ul><ul><ul><li>Format conversion </li></ul></ul><ul><ul><li>Reproducibility </li></ul></ul>
    7. 7. Source and Information in FRBR
    8. 8. Discussion <ul><li>Abstractive + derivational provenance is really powerful. </li></ul><ul><ul><li>Ability to identify content regardless of format and across multiple copies of data. </li></ul></ul><ul><ul><li>Reproducibility is verifiable across different file formats and algorithms as long as the Expressions are the same. </li></ul></ul><ul><li>We have found that FRBR generalizes to information resources. </li></ul><ul><ul><li>Could be considered FRIR (Functional Requirements for Information Resources) </li></ul></ul><ul><li>We’ve implemented it in our LOGD converter (csv2rdf4lod). </li></ul><ul><ul><li>Maintain content links (same Expression) even when manually changing the data, like converting Excel to CSV. </li></ul></ul><ul><ul><li>Particularly useful in much of our Linked Open Data work. </li></ul></ul>
    9. 9. Conclusions <ul><li>Any serious provenance model for linked science must provide a mechanism for describing information sources and their usage. </li></ul><ul><ul><li>Achieved with modeling primitives in PML. </li></ul></ul><ul><li>Abstractive + derivational provenance can express nuanced explanations of data access, transformation, and analysis. </li></ul><ul><li>Links from information to source can be modeled using a combination of FRBR and a derivational provenance model. </li></ul><ul><ul><li>Allows for unambiguous descriptions of data and information access and transformation. </li></ul></ul>
    10. 10. Questions? Thanks! Also, come to SemantAqua demo on Tues and talk on Wed aft. 5:30 The Tetherless World Constellation is partially funded by DARPA, U.S. Department of Energy, Fujitsu, LGS, Lockheed Martin, Microsoft Research, NASA, National Ecological Observatory Network (NEON), the National Science Foundation, Qualcomm, and the Woods Hole Oceonographic Institution (WHOI). This research was partially funded by the National Science Foundation under CREST Grant No. HRD-0734825.
    11. 11. Source, Information, and SourceUsage in PML
    12. 12. Implementation: pcurl.py <ul><li>Part of csv2rdf4lod: </li></ul><ul><li>$ pcurl.py --help </li></ul><ul><li>usage: pcurl.py [--help|-h] [--format|-f xml|turtle|n3|nt] [url ...] </li></ul><ul><li>Download a URL and compute Functional Requirements for Bibliographic Resources (FRBR) stacks using cryptograhic digests for the resulting content. </li></ul><ul><li>Refer to http://purl.org/twc/pub/mccusker2012parallel for more information and examples. </li></ul><ul><li>optional arguments: </li></ul><ul><li>url url to compute a FRBR stack for. </li></ul><ul><li>-h, --help Show this help message and exit. </li></ul><ul><li>-f, --format File format for FRBR stacks. One of xml, turtle, n3, or nt. </li></ul>