Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Supporting crowd-sourced listening experiences with Web Data technologies
1. Supporting crowd-sourced listening
experiences with Web Data
technologies
Mathieu d’Aquin, Alessandro Adamou
Knowledge Media Institute, The Open University
{mathieu.daquin|alessandro.adamou}@open.ac.uk
@mdaquin | @anticitizen79
2. Crowdsourcing in databases
Aggregating data by soliciting contributions from a
community.
examples:
• Discogs, Setlist.fm, Encyclopedia Metallum
• Zooniverse (SETIlive, Old Weather etc.)
• Wikipedia (disputed?)
• UK Reading Experience Database
• Historic Cambridge Newspaper Collection
3.
4. Bootstrapping a crowdsourced
database
• Earliest contributors implicitly dictate de facto
quality standards that will be followed by their
successors.
5. Bootstrapping a crowdsourced
database
• Earliest contributors implicitly dictate de facto
quality standards that will be followed by their
successors.
• Risk of recreating the same data multiple
times: no benefit from prior contributions.
6. Bootstrapping a crowdsourced
database
• Earliest contributors implicitly dictate de facto
quality standards that will be followed by their
successors.
• Risk of recreating the same data multiple
times: no benefit from prior contributions.
A non-empty initial database can mitigate these
issues.
7.
8. Naïve aggregation – same user
“La Valse”
location: London
location-country: UK
date: 30 Oct. 1930
performer: Benjamin Britten
performer-birthdate: 22 Nov. 1913
performer-birthplace: Lowestoft
performer-birth_country: UK
performer-deathdate: 4 Dec. 1976
performer-deathplace: Aldeburgh
performer-birth_country: UK
performer-occupation: Musician
composer: Maurice Ravel
performer-birthdate: ….
………….
(event attended by Benjamin
Britten)
location: Queen’s Hall
location-country: UK
date: 23 Sept. 1930
listener: Benjamin Britten
listener-birthplace: …
listener-birth_country: …
…
…
Not again! I’ll just write “Benjamin Britten”
9. Reuse
“La Valse”
location: London
location-country: UK
date: 30 Oct. 1930
performer: Benjamin Britten
composer: Maurice Ravel
Benjamin Britten
birthdate: 22 Nov. 1913
birthplace: Lowestoft
birth_country: UK
deathdate: 4 Dec. 1976
deathplace: Aldeburgh
death_country: UK
occupation: Musician
(event attended by Benjamin
Britten)
location: Queen’s Hall
location-country: UK
date: 23 Sept. 1930
listener: Benjamin Britten
“Benjamin Britten” is still not shared
across the userbase.
10. Still, for two users:
• Two “Benjamin Britten”s
• Different degrees of detail (possibly
discordant!)
• Two different sets of
performances/experiences for each Britten.
11. Reuse, cross-user
“La Valse”
location: London
location-country: UK
date: 30 Oct. 1930
performer: Benjamin Britten
composer: Maurice Ravel
(event attended by Benjamin
Britten)
location: Queen’s Hall
location-country: UK
date: 23 Sept. 1930
listener: Benjamin Britten
Benjamin Britten
birthdate: 22 Nov. 1913
birthplace: Lowestoft
birth_country: UK
deathdate: 4 Dec. 1976
deathplace: Aldeburgh
death_country: UK
occupation: Musician
User A User B
12. …but does any modern
database really start up
empty today?
13. Reuse, cross-user, cross-database
Wikipedia
MusicBrainz
Geonames
Benjamin
Britten
birthdate: 22 Nov. 1913
birthplace: Lowestoft
deathdate: 4 Dec. 1976
deathplace: Aldeburgh
occupation: Musician
“La Valse”
composer: Maurice Ravel
part of: Catalogue Marcel
Marnat des oeuvres del MR
Lowestoft
region: Suffolk
country: UK
Queen’s Hall
location: London
London
country: UK
User A
(performance of) “La Valse”
location: London
date: 30 Oct. 1930
performer: Benjamin Britten
User B
(event attended by Benjamin
Britten)
location: Queen’s Hall
date: 23 Sept. 1930
listener: Benjamin Britten
14. Almost every database can be bootstrapped out
of a rich, human-readable and machine-readable
data source: the Web [of data].
Because the Web isn’t just pages anymore.
20. • URIs identify things, not pages
– They may also produce Web pages
• Data are encoded in terms of relations
between the things identified by the URIs
<http://...Eine_Kleine_Nachtmusik_(album)>
<http://...performer> <http://...Venom_(band)> ;
<http://…track> <http://... 4b42269f4510> .
<http://... 4b42269f4510> <http://…title> “Countess Bathory” .
• There are standards for making these relations
machine-readable
– data representation paradigm: RDF
– query language: SPARQL
21. Most of all, reuse URIs!
If a URI that identifies the song “Countess Bathory” is
http://musicbrainz.org/recording/c3a0be45-d8c3-4e16-
b44c-4b42269f4510, that does not make MusicBrainz
the only authority with the right to provide and store
data about it using this name.
26. “We have had a charming Concert.... Mr. Jones, the harper, began the Concert. He has a fine
instrument of Merlin's construction; he plays with great neatness and delicacy; but as
expression must have meaning, he does not abound in that commodity. […]”
- Diary of Frances Burney, May 1775 ?
Charles Burney Frances Burney Baron Deiden Baroness Deiden Edward Jones Mr. Merlin
?
Miss Burney
performances:
• harp music (perf. Edward Jones)
• harpsichord duet (comp. Muthel; perf. Charles Burney, Miss Burney)
• keyboard music (comp. Charles Burney, Echard, Schobert; perf. Charles Burney, Miss Burney…)
• vocal music (perf. Miss Louisa Harris)
27. “Our Concert proved to be very much the Thing... ....Mr Burney... Fired away, with his usual
successful velocity, to the amazement and delight of all present... [He] played a concerto of
Schobert, and one of my Father's , and a great deal of Extemporary Preluding. […]”
- Letter from Frances Burney to Samuel Crisp, May 1775
Charles Burney Frances Burney Baron Deiden Baroness Deiden Edward Jones John Joseph Merlin
Esther Burney
performances:
• harp music (perf. Edward Jones)
• harpsichord duet (comp. Muthel; perf. Charles Burney, Esther Burney)
• Lesson by Charles Burney (comp. Charles Burney; perf. Esther Burney)
• Rondeau from Piramo and Tisbé (comp. Venanzio Rauzzini; perf. James Harris, Miss Harris)
• piece by Johann Gottfried Eckard (comp. Johann Gottfried Eckard; perf. Esther Burney)
• concerto by Johann Schobert (comp. Johann Schobert; perf. Charles Burney)
28. “We have had a charming Concert.... Mr. Jones, the harper, began the Concert. He has a fine
instrument of Merlin's construction; he plays with great neatness and delicacy; but as
expression must have meaning, he does not abound in that commodity. […]”
- Diary of Frances Burney, May 1775
“Our Concert proved to be very much the Thing... ....Mr Burney... Fired away, with his usual
successful velocity, to the amazement and delight of all present... [He] played a concerto of
Schobert, and one of my Father's , and a great deal of Extemporary Preluding. […]”
- Letter from Frances Burney to Samuel Crisp, May 1775
Charles Burney Frances Burney Baron Deiden Baroness Deiden Edward Jones John Joseph Merlin
Esther Burney
performances:
• harp music (perf. Edward Jones)
• harpsichord duet (comp. Muthel; perf. Charles Burney, Esther Burney)
• Lesson by Charles Burney (comp. Charles Burney; perf. Esther Burney)
• Rondeau from Piramo and Tisbé (comp. Venanzio Rauzzini; perf. James Harris, Louisa Harris)
• piece by Johann Gottfried Eckard (comp. Johann Gottfried Eckard; perf. Esther Burney)
• concerto by Johann Schobert (comp. Johann Schobert; perf. Charles Burney)
31. Motivation
• Consolidate existing Web Data by integrating
new or refined information.
• Assist information retrieval applications by
providing an additional data node to traverse.
• Provide factual support to assess the
truthfulness of stated assertions on the Web.
32. BNB: British National Bibliography (The British Library)
DBpedia: structured data from Wikipedia infoboxes
LinkedBrainz: MusicBrainz as linked data
VIAF: Virtual International Authority File
DBpedia
BNB
VIAF
Geo-names
Linked
Brainz
34. LED data are re-published
as a Linked Open Data set
• Hosted at http://data.open.ac.uk
• SPARQL query service at
http://data.open.ac.uk/query
• Documentation at
http://led.kmi.open.ac.uk/linkeddata
36. New data
Historical music performances
Royal Carl Rosa Company – “Faust”
for orchestra and voice
date: 14 May, 1917
location: Garrick Theatre
Patron’s Fund - “The Birthday of the Infanta”
date: 9 July, 1931
location: London (indoors, private space)
(you won’t find them on last.fm or setlist.fm)
37. New data
Portions and quotes of source documents / manuscripts
Journeying boy : the diaries of the young Benjamin Britten
1928-1938
(provided by the British Library)
Author: Benjamin Britten
Editor: John Evans
Published: Faber, London, 2009
ISBN: 9780571238835
…
(provided by LED)
Diary entries:
• Page 17, Feb 14 1929: “Still absent from school work. Everso much more snow […]”
• Page 67, March 18 1931: “Go with Mummy to B.B.C – Beethoven concert […]”
• Page 70, April 22 1931: “Go to John Nicholson’s to tea at 2.45. & to hear Gramophone
records on his new Radio-Gram Hear. Brahms. Pft. Concerto Mov. 1. (Rubenstein) Tchaik.”
• …
38. Refinements of existing data
Mary Somerville
(Provided by DBpedia)
Born: 1780-12-26 in Jedburgh
Died: 1872-11-28 in Naples
Field: Polymath, Science journalism
VIAF ID: 27288356
…
(integrated by LED)
Full name: Mary Fairfax Greig Somerville
Social group: Rulers, chiefs, aristocracy & gentry etc.
Occupation: Scientist
Religion: Christian, Protestant
wrote: Memoir of Mary Somerville (1817, 1840’s, 1849, 1850…)
39. Alignments
dbpedia:Aaron_Copland
dbpedia:Jane_Austen
≡
≡
bnb:CoplandAaron1900-1990
bnb:AustenJane1775-1817
These semantic links are not found on the LD cloud.
By exposing them, we assist Semantic Web applications in the
retrieval of relevant information from multiple data sources.
40. Figures on reuse
Computed on 1102 distinct listening experiences
Type Unique instances Total reuse Peak
People 626 1687 184
Written works 902 990 46
Geographical locations 492 266 70
Musical items (songs, albums, performances) 2400 337 22
Musical genres 78 644 219
from external data sources
Source Reused distinct instances
DBpedia 823
BNB 339
data.gov.uk 816
MusicBrainz SOON
41. Questions?
Mathieu d’Aquin, Alessandro Adamou
Knowledge Media Institute, The Open University
{mathieu.daquin|alessandro.adamou}@open.ac.uk
@mdaquin | @anticitizen79
Editor's Notes
On behalf of Mathieu’ sitting over there on how we combined cutting-edge data management with the practices of crowdsourcing
Having a non-empty database to begin with can help a lot. Though this seems a contradiction, it is only an apparent one.
We started off with a set of curated entries, so crowdsourcing It is not the only way LED aggregates content, but it is the trickiest
Wouldn’t it be great if we just store the essential information that is unique to our knowledge and rely upon authoritative data sources for the rest? This would be the machine equivalent of when I tell someone “Hey, I just went to see a Ravel concert” “Who’s Ravel?” and I would quite practically, if impolitely, answer “Ah come on, look it up on Wikipedia!”
What seems to be the catch with it? It would appear that of I wanted to put the data on those websites to good use, I would have to either hack my way through to the underlying databases, or program my software systems to read the Web pages, with all the unpredictability that comes with it.
The never-abused enough Mozart example, but with a twist.
Google did not start by publishing these structured data firsthand, but rather compiling them out of data sources from the Web, the same it indexes for us to search.
This kind of alignment can be done on-the-fly, not so much because we are reusing Web Data, but because we adopt the same paradigm as Linked Data within the LED system, where each entity is a node and we just move references around.
For the record, the Linked Data cloud is something that looks like this today, with every circle being a data provider
What does this mean? That we have saved users the burden of rewriting information about the same people like seventeen hundred times overall, and up to 184 times for one person, whom I think it’s Britten