1. 2012 INTERNATIONAL ASIAN SUMMER SCHOOL IN LINKED DATA
IASLOD 2012, August 13-17, 2012, KAIST, Daejeon, Korea
Identity and schema for Linked Data
Hideaki Takeda
National Institute of Informatics
takeda@nii.ac.jp
Hideaki Takeda / National Institute of Informatics
2. How to put the data into computer?
• How to describe the data?
– The way to describe individual data
• Schema/Class/Concept
– The way to describe relationship among
schema/class/concept
• Ontology/Taxonomy/Thesaurus
• How to refer the data?
– The way to identify individual data
• Identifier
– Relationship among identifiers
Hideaki Takeda / National Institute of Informatics
3. Architecture for the Semantic Web
The world of classes (Ontologies)
The world of instances
(Linked Data)
Tim Berners-Lee http://www.w3.org/2002/Talks/09-lcs-sweb-tbl/
Hideaki Takeda / National Institute of Informatics
4. Layers of Semantic Web
• Ontology
– Descriptions on classes
– RDFS, OWL
– Challenges for ontology building
• Ontology building is difficult by nature
– Consistency, comprehensiveness, logicality
• Alignment of ontologies is more difficult
Descriptions on classes
Ontology
インスタンスに関する記述
Linked Data
Tim Berners-Lee http://www.w3.org/2002/Talks/09-lcs-sweb-tbl/
Hideaki Takeda / National Institute of Informatics
5. Layers of Semantic Web
• Linked Data
– Descriptions on instances (individuals)
– RDF + (RDFS, OWL)
– Pros for Linked Data
• Easy to write (mainly fact description)
• Easy to link (fact to fact link)
– Cons for Linked Data
• Difficult to describe complex structures
• Still need for class description (-> ontology)
Descriptions on classes
Ontology
Description on instances
Linked Data
Tim Berners-Lee http://www.w3.org/2002/Talks/09-lcs-sweb-tbl/
Hideaki Takeda / National Institute of Informatics
6. Importance of Identifiers for Entities
• Everything should be identifiable!
• Human can identify things with vague
identifiers or even without identifiers with
help from the context around things
• On the web, the context is usually not
available and the computer can seldom
understand the context even if it exists
• So we need identifiers for all things
Hideaki Takeda / National Institute of Informatics
7. Identification System
• Identification is one of the primary functions for
human information processing
– Naming: e.g., names for people, pets, and some daily
things
• OK if the number of things is not so big
– Systematic Identification
• e.g., phone number, post-code, passport number, product number,
ISBN
• If the number of things is big enough
• Requirements for Systematic Identification
– Identifier is stable and sustainable
– Uniqueness is guaranteed
– Identifier publisher is reliable and sustainable
Hideaki Takeda / National Institute of Informatics
8. Identification system for Web
• Not so different from conventional identification systems
• Difference
– Cross-system use
– Truly digitized
• Requirements for Systematic Identification for web
– Identifier is stable and sustainable (even after an entity may
disappear)
– Uniqueness is guaranteed over all systems
– Description on should be associated to identifiers
• since entities may not accessible
– Identifier publisher is reliable and sustainable
Hideaki Takeda / National Institute of Informatics
9. Solutions for the Requirements by LOD
• Requirements for Systematic Identification for
web
– 1. Identifier is stable and sustainable (even after an
entity may disappear)
• (up to each identifier publisher)
– 2. Uniqueness is guaranteed over all systems
• URI (not URN)
– 3. Description on should be associated to identifiers
• Dereferenceable URI
– If URI is accessed, a description associated to it should be
returned
– 4. Identifier publisher is reliable and sustainable
Hideaki Takeda / National Institute of Informatics
10. Some examples
ISBN(International Standard Book Number)
• Abstract
– a unique numeric commercial book identifier
– 13 digits
• Prefix: 978 or 979 (for compatibility with EAN code)
• Group(language-sharing country group): 1 to 5 digits
• Publisher code:
• Item number:
• Check num: 1 digit
– Management: two layers
• National ISBN Agency – Publisher
• Requirement Satisfaction
– 1. (Stable ID) Maybe (versioning often matters, and sometimes publisher may
re-use ISBN)
– 2. (Unique ID) Uniqueness is guaranteed but not URI
– 3. (Dereferenceable) No mechanisms (amazon does instead!)
– 4. (Reliable publisher) Yes Hideaki Takeda / National Institute of Informatics
11. Some examples
DOI (Digital Object Identifier)
• Abstract
– An identifier for scientific digital objects (mostly scientific articles)
– An unfixed string: “prefix/suffix”
• Prefix: assigned for publishers
• Suffix: assigned for each object
– Management: three layers
• IDF (International DOI Foundation) – Registration Agency – Publisher
• Requirement Satisfaction
– 1. (Stable ID) Yes (not re-usable)
– 2. (Unique ID)Uniqueness is guaranteed and URI accessible
(http://dx.doi.org/”DOI”)
– 3. (Dereferenaceable)Mapping to object pages but no RDF
– 4. (Reliable publisher) Maybe
Hideaki Takeda / National Institute of Informatics
12. Some examples
Dbpedia (as Identifier)
• Abstract
– A wikipedia page
– Name of wikipedia page
• Maintained manually
– Disambiguation page
– Redirect page
• Requirement Satisfaction
– 1. (Stable ID) maybe (sometimes disappear, sometimes change names,
sometime change contents)
– 2. (Unique ID) Uniqueness is mostly guaranteed and URI accessible
– 3. (Dereferenceable) RDF
– 4. (Reliable publisher) Maybe
•
Hideaki Takeda / National Institute of Informatics
13. Identification of relationship between
identifiers
• Co-existence of multiple identification systems on a
field
– Difference of coverage
– Difference of Viewpoint
An entity can have multiple identifiers
Need for mapping between identifiers in different
identification systems
Method: Use special properties
owl:sameAs, (rdfs:seeAlso, skos:exactMatch)
http://sameas.org
Some problems
– Logical inconsistency with owl:sameAs
– Maintainance
Hideaki Takeda / National Institute of Informatics
15. Summary for ID
• Identification is the crucial part in LOD
– Data availability
– Data inconsistency
– Data interoperability
• Establishment of a good identification system
leads a reliable and sustainable LOD.
Hideaki Takeda / National Institute of Informatics
16. Structuring Information
• A wide range of structuring information
– Keywords, tags
• A freely chosen word or phrase just indicating some features
– Controlled vocabulary
• Mapping to the fixed set of words or phrases
• e.g., the list of countries, the name authorities
– Classification
• System for classifying entities. Often hierarchical. Class may not carry meaning.
– Taxonomy
• Hierarchical term system for classification. Upper/lower relation usually means
general/specific relation
• e.g., the subject headings of LC
– Thesaurus
• System for semantics. More different types of relations: (hypersym, hyposym),
synonym, antonym, homonym, holonym, meronym
– Ontology
• System of concepts. Concepts rather than words. More various relations, the
definitions of concepts
Hideaki Takeda / National Institute of Informatics
17. Examples in Library Science
• Many systems in the library community
• Classification
– Universal Decimal Classification (UDC)
• Controlled Vocabulary
– the authority files for person names, organizations, location names
• Library of Congress : 8 Million records, MADS &SKOS
• British Library: 2.6 million records, foaf & BIO (A vocabulary for
biographical information)
• National Diet Library (Japan): 1 million records, foaf
• Deutsche Nationalbibliothek (DNB, Germany): 1.8 & 1.3 million records
(names & organization),
• Virtual International Authority File (VIAF): 4 million records
• Taxonomy
– Subject Heading: LC, NDL,
• Library of Congress: MADS &SKOS
• British Library:
• National Diet Library (Japan): 0.1 million records, SKOS
• Deutsche Nationalbibliothek (DNB, Germany): 0.16 million records
Hideaki Takeda / National Institute of Informatics
20. UDC ELEMENT DEFINITION
UDC as Linked Data SKOS TERM UDC
SUBPROPERTY
UDC number (notation) UDC notation is combination of symbols (numerals, signs and letters) that represent a class, its skos:notation ---
position in the hierarchy and its relation to other classes. Notation is a language-independent
indexing term that enables mechanical sorting and filing of subjects. Also called 'UDC number'
and 'UDC classmark'
class identifier (URI) A unique identifier assigned to each UDC class. It identifies the relationship between a class' skos:Concept ---
meaning and its notational representation
broader class (URI) Superordinate class: the class hierarchically above the class in question skos:broader ---
caption Verbal description of the class content skos:prefLabel ---
including note Extension of the caption containing verbal examples of the class content (usually a selection of skos:note udc:includingN
important terms that do not appear in the subdivision) ote
application note Instructions for number building, further extension and specification of the class skos:note udc:application
Note
scope note Note explaining the extent and the meaning of a UDC class. Used to resolve disambiguation or skos:scopeNot ---
to distinguish this class from other similar classes e
examples Examples of combination are used to illustrate UDC class building i.e. complex subject skos:example ---
statements
see also reference Indication of conceptual relationship between UDC classes from different hierarchies skos:related ---
<skos:Concept rdf:about="http://udcdata.info/025553">
69,000 records <skos:inScheme rdf:resource="http://udcdata.info/udc-schema"/>
40 Languages <skos:broader rdf:resource="http://udcdata.info/025461"/>
<skos:notation rdf:datatype="http://udcdata.info/UDCnotation">510.6</skos:notation>
<skos:prefLabel xml:lang="en">Mathematical logic</skos:prefLabel>
<skos:prefLabel xml:lang="ja">記号論理学</skos:prefLabel>
<skos:related rdf:resource="http://udcdata.info/000016"/>
http://udcdata.info/ </skos:Concept>
Hideaki Takeda / National Institute of Informatics
24. Some examples
Scientific Names for Species and Taxa
• Abstract
– Names for biological species and other taxa (kingdom, divison, class,
order, family, tribe, genus)
– A string
• Binomial name for species
• Academic societies maintain taxon names individually
– E.g., Papilo xuthus (Asian Swallowtail, ナミアゲハ,호랑나비)
• Requirement Satisfaction
– 1. Mostly yes (sometimes disappear, change names, change contents)
– 2. Uniqueness is generally guaranteed but precise speaking some ambiguity
because of change.
– 3. No. Many systems exists but none covers all species
– 4. Maybe
Hideaki Takeda / National Institute of Informatics
26. Ontology
An ontology is an explicit specification of a
conceptualization [Gruber]
An ontology is an explicit specification of a conceptualization. The
term is borrowed from philosophy, where an Ontology is a systematic
account of Existence. For AI systems, what "exists" is that which can
be represented. When the knowledge of a domain is represented in a
declarative formalism, the set of objects that can be represented is
called the universe of discourse. This set of objects, and the
describable relationships among them, are reflected in the
representational vocabulary with which a knowledge-based program
represents knowledge. Thus, in the context of AI, we can describe the
ontology of a program by defining a set of representational terms. In
such an ontology, definitions associate the names of entities in the
universe of discourse (e.g., classes, relations, functions, or other
objects) with human-readable text describing what the names mean,
and formal axioms that constrain the interpretation and well-formed
use of these terms. Formally, an ontology is the statement of a logical
theory.
Hideaki Takeda / National Institute of Informatics
27. Conceptualization
object
on_desk(A)
box on(A, B)
put(A,B)
red box blue box yellow box
object on_desk(A) object on(A/box, B/object)
on(A, B) put(A/box,B/object)
put(A,B)
box box desk
box
box color:{red, blue, yellow}
color:{red, blue, yellow}
There are many possible ways to conceptualize the target world
Trade off between generality and efficiency
Hideaki Takeda / National Institute of Informatics
28. Types of Ontologies
• Upper (top-level) ontology vs. Domain ontology
– Upper Ontology: A common ontology throughout all domains
– Domain Ontology: An ontology which is meaningful in a specific
domain
• Object ontology vs. Task ontology
– Object Ontology: An ontology on “things” and “events”
– Task Ontology: An ontology on “doing”
• Heavy-weight ontology vs. light-weight ontology
– Heavy-weight ontology: fully described ontology including
concept definitions and relations, in particular in a logical way
– Light-weight ontology: partially described ontology including
typically only is-a relations
Hideaki Takeda / National Institute of Informatics
29. Top-level ontology
• Ontology which covers all of the world!
• Very…. Difficult
– e.g., how does a thing exist?
• A thing is four dimensional existence?
• A thing exists three-dimensionally over time?
• Common requirements
– A small number of concepts can cover the world
– Concepts can be used in lower ontologies
– Concept should be general and abstract
Hideaki Takeda / National Institute of Informatics
30. • Three approaches Top-level ontology
– Formal approach
• Logical formalization
• Fully Abstract
• Pros: clean
• Cons: hardly understandable
• e.g., Sowa’s top-level ontology, DOLCE
– Linguistic approach
• Use and extension of linguistic concepts
• Partially abstract and partially general
• Pros: understandable
• Cons: limitation to the linguistic world
• e.g., Penman Upper Model, WordNet
– Empirical Approach
• Use and extension of everyday concepts
• Mostly general
• Pros: understandable and applicable to all the world
• Cons: lack of solid foundation
• e.g. SUMO, Cyc, EDR
Hideaki Takeda / National Institute of Informatics
31. Empirical top-level ontology
• SUMO(Suggested Upper
Merged Ontology)
– Collection and organization of Substance
concepts used frequently Object
SelfConnectedObject
CorpuscularObject
Organic
Inorganic
– Simple relationship between Phsical
Collection
Biological
Phisiologic
Process
NaturalProcess
concepts Process
Pathojogic
Process
ChangeOfProssession
Process Intentionally
Caused Searching Communication
Process
Entity Social Cooperation
Interaction
Contest
Meeting
Transfer Impelling
Putting
Impacting
Motion
Removing
BringingTogether
Abstract ChangeOf Transportation
State Separating
Hideaki Takeda / National Institute of Informatics
32. Formal Ontology: DOLCE
• DOLCE(a Descriptive Ontology for Linguistic
and Cognitive Engineering)
– Intended to a reference system for top-level
ontology
– Logical definition
– Particular (DOLCE) vs. Universal
• Particular: ontology about things, phenomena, quality…
• Universal: ontology for describing particular like
categories and attributes
Hideaki Takeda / National Institute of Informatics
33. M
Formal Ontology: DOLCE
Amount of
Matter
PED
F
Physical APO
Feature
Endurant Agentive
Physical Object
POB
Physical
• Concepts Object NAPO
Non-agentive
– Endurant / Perdurant / Quality / Abstract NPED
Non-Physical
Physical Object
NPOB MOB
• Endurant: ED
Endurant
Non-physical
Object
Mental Object
Endurant
– “Things” AS
Arbitrary
SOB
Social Object
– An existence over time Sum
ACH
– May change its attribute
Achievement
EV
Event
PD ACC
• Perdurant ALL
Entity
Perdurant
Occurence
Accomplishment
– “process” STV
ST
State
– No change over time Stative
PRO
– May switch a part to the other Process
• Relations Q
TQ
Temporal Quality
TL
Temporal Location
– Parthood (abstract or perdurant) Quality
PQ
SL
Physical Quality
– Temporally Parthood (endurant) AQ
Spatial Location
– Constitution (endurant or perdurant) Abstract Quality
– Participation between perdurant and endurantAB Fact TR
Temporal Region
Abstract T
Set Time Interval
PR
Physical Region
R S
Region Space Region
AR
Abstract Region
Hideaki Takeda / National Institute of Informatics
34. Linguistic top-level ontology
• WordNet
– A lexical reference system
• “Link-based electronic dictionary”
http://www.cogsci.princeton.edu/cgi-bin/webwn
– Concepts
• synset
– Noun 79,689
– Verb 13,508
– Relations
• synonym
• hypernym/hyponym (is-a)
• holonym/meronym (a-part-of)
Hideaki Takeda / National Institute of Informatics
35. •
Linguistic top-level ontology
WordNet
– Top-level
• { entity, physical thing (that which is perceived or known or inferred to
have its own physical existence (living or nonliving)) }
• { psychological_feature, (a feature of the mental life of a living organism) }
• { abstraction, (a general concept formed by extracting common features
from specific examples) }
• { state, (the way something is with respect to its main attributes; "the
current state of knowledge"; "his state of health"; "in a weak financial
state") }
• { event, (something that happens at a given place and time) }
• { act, human_action, human_activity, (something that people do or cause
to happen) }
• { group, grouping, (any number of entities (members) considered as a
unit) }
• { possession, (anything owned or possessed) }
• { phenomenon, (any state or process known through the senses rather
than by intuition or reasoning) }
Hideaki Takeda / National Institute of Informatics
36. Summary for structuring information
• Keywords, tags/Controlled vocabulary
/Classification/Taxonomy /Thesaurus/Ontology
– The difference is not clear, not important
– The trend is to go more structured ones
– The same requirements to Identification systems
Hideaki Takeda / National Institute of Informatics
37. Summary
• Requirements for Successful Structuring
Systems
– 1. Entity is stable and sustainable
LOD Tech.
– 2. Uniqueness is guaranteed over all systems can help
– 3. Description on should be associated to entity
– 4. System publisher is reliable and sustainable
• Learn from success in the library community
Hideaki Takeda / National Institute of Informatics
38. Schema/Vocabulary for LOD
• Class/Concept description
– Axiom of a concept in ontology
– Database schema for a table in Relational database
– Object definition in Object-Oriented Programming/DB
• Class description in Semantic Web
– RDFS/OWL description for a class
• RDFS: Simple class system
• OWL: Description Logic-based
• Class description in Linked Data
– Mostly RDFS-based (exception: owl:sameAs)
– Simple Structure (mostly property-value pair)
Hideaki Takeda / National Institute of Informatics
39. Schema/Vocabulary for LOD
• The importance of sharing schema
– Interoperability
– Generic applications
• Some famous and frequently used shemata
– Dublin Core
– FOAF (Friend-Of-A-Friend)
– SKOS (Simple Knowledge Organization System)
Hideaki Takeda / National Institute of Informatics
40. Usage of Common Vocabularies
Prefix Namespace Used by
dc http://purl.org/dc/elements/1.1/ 66 (31.88 %)
foaf http://xmlns.com/foaf/0.1/ 55 (26.57 %)
dcterms http://purl.org/dc/terms/ 38 (18.36 %)
skos http://www.w3.org/2004/02/skos/core# 29 (14.01 %)
akt http://www.aktors.org/ontology/portal# 17 (8.21 %)
geo http://www.w3.org/2003/01/geo/wgs84_pos# 14 (6.76 %)
mo http://purl.org/ontology/mo/ 13 (6.28 %)
bibo http://purl.org/ontology/bibo/ 8 (3.86 %)
vcard http://www.w3.org/2006/vcard/ns# 6 (2.90 %)
frbr http://purl.org/vocab/frbr/core# 5 (2.42 %)
sioc http://rdfs.org/sioc/ns# 4 (1.93 %)
LDOW2011 Presentation, Christian Bizer (Freie Universität Berlin), 2011
Hideaki Takeda / National Institute of Informatics
41. (Simple) Dublin Core
• Started from the library • 15 elements
community – Title
• Now maintained by DCMI (Dublin – Creator
Core Metadata Initiative) – Subject
• (Simple) Dublin Core – Description
– Just 15 elements – Publisher
– Simple is best – Contributor
– No range restriction – Date
– http://purl.org/dc/elements/1.1/ – Type
– Format
– Identifier
– Source
– Language
– Relation
– Coverage
– Rights
Hideaki Takeda / National Institute of Informatics
42. dc terms
• Qualified Dublin Core
– Domain & Range
– More precise terms
• Extension of simple dc
Properties in the / abstract , accessRights , accrualMethod , accrualPeriodicity , accrualPolicy , alternative , audience , available , bibliograp
hicCitation ,conformsTo , contributor , coverage , created , creator , date , dateAccepted , dateCopyrighted , dateSubmit
ted , description ,educationLevel , extent , format , hasFormat , hasPart , hasVersion , identifier , instructionalMethod , i
sFormatOf , isPartOf , isReferencedBy ,isReplacedBy , isRequiredBy , issued , isVersionOf , language , license , mediator ,
medium , modified , provenance , publisher , references ,relation , replaces , requires , rights , rightsHolder , source , sp
atial , subject , tableOfContents , temporal , title , type , valid
Properties in the contributor , coverage , creator , date , description , format , identifier , language , publisher , relation , rights , source , s
/elements/1.1/namespace ubject , title , type
Vocabulary Encoding Schemes DCMIType , DDC , IMT , LCC , LCSH , MESH , NLM , TGN , UDC
Syntax Encoding Schemes Box , ISO3166 , ISO639-2 , ISO639-3 , Period , Point , RFC1766 , RFC3066 , RFC4646 , RFC5646 , URI , W3CDTF
Classes Agent , AgentClass , BibliographicResource , FileFormat , Frequency , Jurisdiction , LicenseDocument , LinguisticSystem ,
Location ,LocationPeriodOrJurisdiction , MediaType , MediaTypeOrExtent , MethodOfAccrual , MethodOfInstruction , Pe
riodOfTime , PhysicalMedium ,PhysicalResource , Policy , ProvenanceStatement , RightsStatement , SizeOrDuration , Sta
ndard
DCMI Type Vocabulary Collection , Dataset , Event , Image , InteractiveResource , MovingImage , PhysicalObject , Service , Software , Sound , Sti
llImage , Text
Terms related to the DCMI memberOf , VocabularyEncodingScheme
Abstract Model Hideaki Takeda / National Institute of Informatics
45. SKOS (Simple Knowledge Organization
System)
• Metadata for taxonomy
– Hierarchical structure of concepts
• Invented to represent taxonomy such as subject
heading
• =/= subclass relationship among classes
• W3C Recommendation 18 August 2009
Hideaki Takeda / National Institute of Informatics
47. SKOS (Simple Knowledge Organization
System)
• SKOS Mapping
– skos:mappingRelation
– skos:closeMatch
subPropertyOf
– skos:exactMatch
– skos:broadMatch
– skos:narrowMatch
– skos:relatedMatch
Hideaki Takeda / National Institute of Informatics
48. Linked Open Vocabulary (LOV)
• A technical platform for search and quality
assessment among the vocabularies
ecosystem
– Register schemata
– Search schemata
• http://labs.mondeca.com/dataset/lov/
Hideaki Takeda / National Institute of Informatics
51. Summary for schema
• Some major schemata
– DC, DC terms, FOAF, SKOS …
• More domain-specific schemata
– CIDOC CRM
– PRISM
–…
• Re-using is highly recommended
– LOV
Hideaki Takeda / National Institute of Informatics
52. Summary
• Three layers
– Ontology/Thesaurus/Taxonomy
– Schema
– Identification
• Not just top-down, rather bottom-up
• Each layer has own role
• Not pursue the value of each layer, rather
make a good combination of them
Hideaki Takeda / National Institute of Informatics