The document discusses facilitating the discovery of public datasets. It describes Schema.org, a collaborative project to add metadata to content using microdata, RDFa or JSON-LD formats. It also discusses challenges in identifying and relating datasets, as well as properties for describing datasets, such as name, description, URL, version, and spatial/temporal coverage. An example is given of markup for a seismic hazard zones dataset using these properties.
3. Nafiseh.Navabpour@uni-jena.de Facilitating the discovery of public datasets 3 / 13
Schema.org
collaborative project between Google, Yandex, Bing and Yahoo
Content provider use the Schema.org vocabulary with the Microdata, RDFa or JSON-LD
formats to add information inside the content.
4. Mark up the content using microdata
Nafiseh.Navabpour@uni-jena.de Facilitating the discovery of public datasets 4 / 13
<div itemscope>
<h1>Avatar</h1>
<span>Director: James Cameron (born August 16, 1954) </span>
<span>Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html">Trailer</a>
</div>
itemscope
element
5. Mark up the content using microdata
Nafiseh.Navabpour@uni-jena.de Facilitating the discovery of public datasets 5 / 13
<div itemscope itemtype="http://schema.org/Movie">
<h1>Avatar</h1>
<span>Director: James Cameron (born August 16, 1954)</span>
<span>Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html">Trailer</a>
</div>
itemscope
element
itemtype
attribute
6. Mark up the content using microdata
Nafiseh.Navabpour@uni-jena.de Facilitating the discovery of public datasets 6 / 13
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Avatar</h1>
<span>Director: <span itemprop="director">James Cameron</span> (born ...
<span itemprop="genre">Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>
itemscope
element
itemtype
attribute
itemprop
attribute
7. Mark up the content using microdata
Nafiseh.Navabpour@uni-jena.de Facilitating the discovery of public datasets 7 / 13
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Avatar</h1>
<div itemprop="director" itemscope itemtype="http://schema.org/Person">
Director: <span itemprop="name">James Cameron</span>
</div>
<span itemprop="genre">Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>
itemscope
element
itemtype
attribute
itemprop
attribute
Embedded
items
8. Technical / Social / Research Challenges
• Defining more consistently what constitutes a dataset
• Identifying datasets
• Relating datasets to each other
• Propagating metadata between related datasets
• Describing content of datasets
• Many datasets are described in unstructured way
Nafiseh.Navabpour@uni-jena.de Facilitating the discovery of public datasets 8 / 13
9. What qualifies as a dataset?
• A table or a CSV file with some data
• A file in a proprietary format that contains data
• A collection of files that together constitute some meaningful dataset
• A structured object with data in some other format that you might want to
load into a special tool for processing
• Images capturing the data
Anything that looks like a dataset to you
Nafiseh.Navabpour@uni-jena.de Facilitating the discovery of public datasets 9 / 13
10. Basic dataset properties
itemtype="http://schema.org/Dataset"
• Name
• Description
• URL(s)
• Version number
• Keywords
• Variable Measured
• Creator name (person, organization)
Nafiseh.Navabpour@uni-jena.de Facilitating the discovery of public datasets 10 / 13
11. More properties
• Data catalog properties
• Download information properties
• Temporal coverage
• Spatial coverage
• Points
• Coordinates
• Named locations
• Citations and publications
• Provenance and license information
Nafiseh.Navabpour@uni-jena.de Facilitating the discovery of public datasets 11 / 13
12. <div itemscope="itemscope" itemtype="http://schema.org/Dataset">
<meta itemprop="url" content="http://www.example.org/story.php?title=seismic-hazard-zones"/>
<span itemprop="name">
<a href="http://www.example.org/story.php?title=seismic-hazard-zones">
<b>Seismic Hazard Zones</b>
</a>
</span>
<span itemprop="temporal">2011</span>, version "<span itemprop="version">2011-Sep-13</span>"
<div itemprop="description">This is a dataset of liquefaction and landslide zones in the state of
California.</div>
<div itemprop="spatial" itemscope="itemscope" itemtype="http://schema.org/Country"
itemid="http://dbpedia.org/resource/United_States">
<i>Country:</i>
<a href="http://en.wikipedia.org/wiki/United_States">
<span itemprop="name">United States</span>
</a>
</div>
...
Nafiseh.Navabpour@uni-jena.de Facilitating the discovery of public datasets 12 / 13
Example
This talk is about “facilitating the discovery of public datasets”.
Because Birgitta would like to discuss this topic here in this group, i prepared this presentation.
As you know, there is a huge number of data repositories in many different fields and with many different purposes.
When we are looking for something, it could be extremely difficult to determine:
- Where is the dataset that has the information that I am looking for
- Where is the origin of this information?
- Is this information reliable?
An idea for get the best result is using the same vocabulary in different websites.
An idea for get the best result is using the same vocabulary in different websites.
Schema.org is an agreed vocabulary of HTML properties between Google, Yandex, Bing and Yahoo
that help search engines understand the meaning of context of a webpage.
In this scenario, a mark-up describe type of a thing.
But how?
Content provider use the Schema.org vocabulary with the Microdata, RDFa or JSON-LD formats to add information inside the content.
I have here an example for mark-up the content, but in microdata format.
At first, content provider should add the itemscope element to the HTML tag that encloses information about a particular item.
But it is not enough. Content provider should also specifies what kind of an item it is.
Content provider should also specifies what kind of an item it is.
For example here the particular item is the movie AVATAR.
itemtype attribute comes immediately after the itemscope,
and it is provided as URL,
defined in the schema.org type hierarchy
Then the properties of an item should be defined by adding itemprop attribute.
For example, to identify the director or genre.
Sometimes the value of an item property can itself be another item with its own set of properties.
For example, we can specify that the director of the movie is an item of type Person and the Person has the properties name and so on.
OK, now we know, how search an item (such as movie or book) with the help of using this kind of mark up is easy.
But how about searching data in a scientific dataset?
We are able to use the same method, but
There is many technical, social and research challenges.
. For example, we have to at first define what a dataset is.
. Working with related datasets also is not easy.
. We need to know how could we mark up the metadata between related datasets?
. Or How to describe the content of a dataset?
. How about the unstructured datasets?
It is important to know that a Dataset is anything that looks like a dataset to you.
The mark-up method is the same that we have seen before.
The item type is this URL.
For each dataset we need to determine some basic properties like the name…
We could also make a dataset more explicit, with determine some more properties…
I think we could also use this kind of mark-up in our websites to define not only for the simple pages such as visitor information or information about people, but also we could determine many datasets: for example publication, talks and so on.
I think we could also use this kind of mark-up in our websites to define not only for the simple pages such as visitor information or information about people, but also we could determine many datasets: for example publication, talks and so on.