Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
1
Peter M. Broadwell
@peterbroadwell
broadwell@library.ucla.edu
Martin Klein
@mart1nkle1n
martinklein@library.ucla.edu
Let the Music Live/
que viva la música
Techniques for Managed Integration of a
Unique Multimedia Collection into Public
Linked Open Data Repositories

2
The collection
http://frontera.library.ucla.edu

3
The collection
• 116,000 songs digitized and made available as audio
files to date, out of an estimated 160,000 in total
• Originally recorded from 1905 to the 1990s on ~2,000
commercial record labels
• Storage footprint of streaming MP3s: 460 GB
Format Number of songs
33 RPM (1955-1990) 14,741
45 RPM (1955-1990) 51,220
78 RPM (1905-1955) 33,191
Cassette tape (1955-1990) 7,879
Reel-to-reel tape *1955-1990) 368
• ~300,000 album images (covers and media)

4
The collection
• 13,752 unique artists or groups on album covers
• 7,035 unique names from album sleeves
• 24,221 unique composers
• 2,000-2,500 labeled song types/genres
Record label # of songs
Victor 8,591
Columbia 8,196
Ideal 4,819
Falcon 4,532
Peerless 3,336
Bego 2,411
Vocalion 2,164
Del Valle 2,145
Song type # of songs
ranchera 21,947
bolero 10,522
corrido 7,393
canción 5,410
polka 4,742
canción ranchera 2,736
cumbia 2,055
vals 1,399

5
The collection
• ~700 unique song tags/keywords (prior to translation)
• All songs tagged with 1-20 keywords (avg ~4.5)

6

7
Chris Strachwitz

8
Arhoolie Records

9

10

11

12
Supporters of the collection

13
Research using the Frontera
collection as a primary source

14
A “multimedia encyclopedia”

15
More metadata, more problems
• No authority values employed for person and group
names; “name hacking” used to approximate
uniqueness
• Relationship between song, album, and “release” is not
consistent
• Authority data for song entities is better: matrix numbers
and catalog numbers are available
• Collection is entirely “siloed” on its current site, largely due
to its homegrown metadata scheme

16
• Adopt metadata structures of open online music
encyclopedias (MusicBrainz)
• Use unique IDs from linked open data knowledge
bases to identify people, groups, companies,
songs, albums, etc.
• Adopting IDs from external LOD sites lets us link
out to these related records
• When records are missing from external LOD
knowledge bases, add them to those sites
automatically
Goal: incorporate Frontera into
the broader semantic web

17
LOD records and relations

18

19

20
Inspiration: Linked Jazz, NYPL
Labs’ ECCO, LD4L

21
Labs’ ECCO, LD4L

22
Labs’ ECCO, LD4L

23
LOD integration: phase 1
Initial metadata cleaning and preparation
• Identify likely unique entities (names, etc.) via “fuzzy
matching,” e.g., MD5 hash comparisons
• Challenge: finding methods that scale to >100,000 rows
(many approaches must be scripted)
• May necessitate creation of Yet Another Database
• Generate audio fingerprints of music files

24
Discovery and linking of existing records
• Entity lookup in LOD knowledge bases
• Audio fingerprint lookups in AcoustID database, which
links to MusicBrainz
• Search for artist, group, and composer names in service
APIs (note: these work better with English than Spanish)
• DBpedia Spotlight
• MusicBrainz
• Discogs
• VIAF, LCNAF (worth a try)
• Combination of automated and crowd-sourced verification
of links, integration into site

25
Contributing/creating new records
• Unsolicited bulk record generation may be seen as linked
data spam and rejected (“notability” problem)
• Direct communication and participation in knowledge
base’s community is the most promising approach
• Case study: discussion with MusicBrainz community
• Voting/editorial review system can be incompatible with
bulk updates, but the community may be willing to
accommodate
• Data records should be well formed and clean; upload
methods must be tested and the upload coordinated
with LOD admins

26

27
LOD integration: the “bot” option

28
LOD integration: crosswalks
between repositories and records

29
Progress to date
• Used metadata cleaning approaches to identify most likely
unique names in the DB

30
Progress to date
• Applied acoustic fingerprinting to all 116,000 audio files
• matched 1,313 songs
• following the AcoustID links to MusicBrainz positively
identifies ~287 artists with their records in MusicBrainz
(as well as Discogs and DBpedia)

31
Progress to date
• Ran DBpedia Spotlight on all artists and composer names,
correlated matched entities with MusicBrainz, Wikidata IDs

32
Progress to date
• Ran DBpedia Spotlight on all artists and composer names,
correlated matched entities with MusicBrainz, Wikidata IDs
• Searched for artist and composer names via MusicBrainz,
Discogs, and VIAF APIs

33
Entity matching to LOD sites
Artists on label
(out of 13,752)
Artists on sleeve
(out of 7,035)
Composers
(out of 24,211)
Acoustic
fingerprinting
287 (for all names)
DBpedia
Spotlight
272 27 72
MusicBrainz
lookup
620 434 1,151
Discogs
search API
4,929 3,502 9,423
VIAF search
API
3,707 3,057 8,889
*These are likely in order of decreasing accuracy!

34
Concerns/next steps
• Scalable approaches for Q/A of data (new and old)
• Discoverability and usability for humans and machines
(APIs)
• Repository integration: adopting a linked data model will
help
• Trusted channels for upload to existing knowledge bases:
design a formal model?
• Work with specialized sub-collections of knowledge bases
(topics, regions)?
• Test DBpedia Spotlight w/Spanish data pack
• Does using links to existing LOD entries just reinforce
inequality of artist exposure (“rich get richer”/LOD “echo
chamber”)?

35
Thanks!
UCLA Digital Library
• Lisa McAulay
• Kristian Allen
• T-Kay Sangwand
• …everyone else (past and present)
Arhoolie Foundation
• Tom Diamant
• Chris Strachwitz (obviously)

36
Thanks!

37
Peter M. Broadwell
@peterbroadwell
broadwell@library.ucla.edu
Martin Klein
@mart1nkle1n
martinklein@library.ucla.edu
Let the Music Live/
que viva la música

Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Recommended

Recommended

More Related Content

Similar to Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Similar to Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories (20)

Recently uploaded

Recently uploaded (20)

Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Editor's Notes