Linkage in Haze: challenges and take-home messages of crowd-sourcing vagueness in musical data

Linkage in Haze
Challenges and take-home messages of crowd-sourcing
vagueness in musical data
Alessandro Adamou
Listening to music: people, practices and experiences
Sunday, October 25, 2015

What we capture in a listening experience

what they said
where
documented where
who
what
when
how
What we capture in a listening experience

A well-formed listening experience
On July 19, 2014
Leonie Holmes (a professor of Music in New Zealand)*
was listening to Johann Strauss’ “Don Juan”
and “Also Spracht Zarathustra”
and Sarah Ballard’s “Synergos”
played by the NZSO National Youth Orchestra
and Alexander Shelley
using harp and double bass (+others?)
in the Aotea Centre.
(*) plus a generic public, which does not pose a problem in the representation.

A worst-case (but more likely)
listening experience
One evening between May and September
in the late 1950’s
a group of war veterans, a reporter and an
unknown female
were listening to Chopin and an anthem
played by a string orchestra
in a concert hall in South London.

Goal: to capture both fact sets
as structured data
(interconnected or prepared for
refinement)

Other issues
• Source
– unpublished manuscripts
• Unaligned semantic layers
– instrument category | instrument name | brand and model
– generic occupations | gender-dependent | personal titles
Monarch
King (Queen), Emperor (Empress)…
King of England, fourth Sultan of Zanzibar…
Chords
Electric Guitar
Gibson Les Paul Custom Sunburst

Factors contributing to fuzziness
• Domain knowledge of the evidence author(s)
• Deterioration of evidence
• Crowd-sourcing community issues:
– Misaligned semantics
– Varying scholarly rigour
– Popularity of the domain of interest

Data representation in LED
• Linked [Open] Data http://linkeddata.org
– Formalism for machine-readable and human-
readable data
– Object identifiers are URIs
– Standard representation and query languages
(RDF, SPARQL…)
– The meaning of links between objects is
globally understood.

Identity in Linked Data
• http://musicbrainz.org/artist/f5aca88c-e3c1-4bc2-af33-
68a9a9f7b56a#_
– (the band Killing Joke, as in MusicBrainz)
• http://bnb.data.bl.uk/id/agent/DailyMirror
– (The Daily Mirror on the British National Bibliography)
• http://dbpedia.org/resource/London
– (London, as in Wikipedia/DBpedia)
• http://reference.data.gov.uk/doc/day/2015-10-23
– (last Friday, as in the UK Government Calendar data)
• http://led.kmi.open.ac.uk/term/Medium.Live
– (the concept of live music, as in LED)

Identity in Linked Data
• http://musicbrainz.org/artist/f5aca88c-e3c1-4bc2-af33-
68a9a9f7b56a#_
– (the band Killing Joke, as in MusicBrainz)
• http://bnb.data.bl.uk/id/agent/DailyMirror
– (The Daily Mirror on the British National Bibliography)
• http://dbpedia.org/resource/London
– (London, as in Wikipedia/DBpedia)
• http://reference.data.gov.uk/doc/day/2015-10-23
– (last Friday, as in the UK Government Calendar data)
• http://led.kmi.open.ac.uk/term/Medium.Live
– (the concept of live music, as in LED)
Easy: these are all named entities…

Goal: to capture both fact sets
as linked data
There are no right or wrong
ways to do it, only linkable or
unlinkable.

Linked Data encourage reuse…
No two things are distinct nor equal, until
some LD node asserts or implies otherwise.
– e.g. Bono on MusicBrainz and Bono on DBpedia
– Groups too, if it can be demonstrated they are an
exact match

…but fuzzy concepts have caveats
Group entity “Mourners of Felix Mendelssohn” (attending
the arrival of his body in Berlin)
See http://led.kmi.open.ac.uk/entity/lexp/1434029100189
• Identifier of the group is
http://data.open.ac.uk/led/agent/Mourners+of+Felix+Mendelssohn/1434029100190
should not be reused when modelling an entry about
Mendelssohn’s funeral service.
See http://led.kmi.open.ac.uk/entity/lexp/1434029387526
– Identifier of the group is
http://data.open.ac.uk/led/person/Mourners+at+the+Funeral+Service+of+Felix+Mende
lssohn/1434029247629

Blank nodes
• Fallback mechanism for providing data about objects
without having a naming convention for them.
• Reference something not by name, but by description.
• Example:
:performance/Messiah/12345 mo:listener [
a foaf:Group ;
dc:description “Foreign ambassadors”
:occupation dbpedia:Ambassador
]
• Generally not an advisable solution:
– Cannot perform matching on blank nodes
– Querying or detecting changes in the data is much harder

Ontological classes
• Model vague objects as formally-specified categories rather than named
entities
• e.g. “the class of all people whose occupation is Ambassador and who were
at the Royal Albert Hall on May 12, 1876”
• Pros:
– Allows separation of “known” and “generic” entities
– Semantically cleaner and easier to store and manage
• Cons:
– Still need to make URIs for each class
– They have to be instantiated before they can be used in a listening
experience
– Harder to apply changes to the data without fixed classes

Countermeasures in LED
• No blank nodes
• For unaligned semantic layers (cf.
example on instruments and
occupations):
–Use lax model properties
–Enforce reuse of external taxonomies
• ‘rich’ real-time recommendations

Data reconciliation
Currently with restricted
access, but plans to open
to crowd-sourcing

• Ad-hoc formal models for underspecified data.
• Example: Extended Date/Time Format (standard draft, Library of
Congress, 2012)
– Allows formalisation of underspecified points in time and intervals, e.g.
“187u-05-uu”
– We extended it to support subjective fuzzy intervals (e.g. early/mid/late)
and ranges (from-to)
– Made available in RDF through data.open.ac.uk
• Example 2: GeoSPARQL
– Used to support geospatial queries in Linked Data
– Named entity recognition on arbitrary text for locations (recently)
– We compute location URIs by hashing their descriptions and all the
locations extracted from it and related via geosparql:sfIntersects

How thick is the mist in LED?
Named Vague Total
Participants 802 260 1062
Locations 136 15
(cannot pinpoint)
151*
Times 826 843
(ranges, not qualified)
1669
Musical works 1550 1263 2813
(*) since database opened to arbitrary experience locations
Figures for LED public dataset

Lessons learnt
• Advantages
– Open-world semantics: minimise risk of ambiguities
generated by name clashes, allows for coherent management
– Monotonic: data are refined by addition of facts
– Can be reasoned upon by machine-learning agents working
on the native data structure
– Incorporates reuse for the benefit of the whole data cloud.
• Disadvantages
– No reuse entails heavy replication
– Data cleansing may require a large context for detecting entities
that can be reconciled

Lessons learnt
• Most, if not all representational issues with vagueness can be
addressed in LD without resorting to blank nodes and safe from
ambiguity.
– Way more powerful that traditional database systems.
• Data providers are yet to reach an albeit silent agreement on:
– representational paradigms for entities commonly at risk of
underspecification, such as spatio-temporal ones;
– how to name their objects.
• Most are making it easy for themselves when it comes to LD
• The way to go is de facto standards

Where to go next
• Model ontological classes as their instances
(equivalence classes?)
• Increase context for fact-based data alignment
(opening reconciliation facilities to the public –
with voting?)
• Argumentation on every statement in LED
• Dissemination of controlled vocabularies and
naming convention for managed vague entities.

Are Linked Data mature for
representing vagueness?
• The technology is.
• The data out there aren’t.
– (but that is the part that can be improved)

Further reading
• Eero Hyvönen, Publishing and Using Cultural Heritage Linked
Data on the Semantic Web (Morgan & Claypool, 2012)
• Daniel J. Lewis and Trevor P. Martin, Managing Vagueness with
Fuzzy in Hierarchical Big Data. In 2015 INNS Conference on Big
Data (Elsevier, 2015), Procedia Computer Science, Vol. 53, p. 19-28
• Fuzzy Logic and the Semantic Web, Elie Sanchez (ed.) (Elsevier,
2006)

Thank you!
QA time
alessandro.adamou@open.ac.uk

Linkage in Haze: challenges and take-home messages of crowd-sourcing vagueness in musical data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Linkage in Haze: challenges and take-home messages of crowd-sourcing vagueness in musical data

Similar to Linkage in Haze: challenges and take-home messages of crowd-sourcing vagueness in musical data (20)

Recently uploaded

Recently uploaded (20)

Linkage in Haze: challenges and take-home messages of crowd-sourcing vagueness in musical data