APM Welcome, APM North West Network Conference, Synergies Across Sectors
Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories
1. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
1
Peter M. Broadwell
@peterbroadwell
broadwell@library.ucla.edu
Martin Klein
@mart1nkle1n
martinklein@library.ucla.edu
Let the Music Live/
que viva la música
Techniques for Managed Integration of a
Unique Multimedia Collection into Public
Linked Open Data Repositories
2. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
2
The collection
http://frontera.library.ucla.edu
3. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
3
The collection
• 116,000 songs digitized and made available as audio
files to date, out of an estimated 160,000 in total
• Originally recorded from 1905 to the 1990s on ~2,000
commercial record labels
• Storage footprint of streaming MP3s: 460 GB
Format Number of songs
33 RPM (1955-1990) 14,741
45 RPM (1955-1990) 51,220
78 RPM (1905-1955) 33,191
Cassette tape (1955-1990) 7,879
Reel-to-reel tape *1955-1990) 368
• ~300,000 album images (covers and media)
4. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
4
The collection
• 13,752 unique artists or groups on album covers
• 7,035 unique names from album sleeves
• 24,221 unique composers
• 2,000-2,500 labeled song types/genres
Record label # of songs
Victor 8,591
Columbia 8,196
Ideal 4,819
Falcon 4,532
Peerless 3,336
Bego 2,411
Vocalion 2,164
Del Valle 2,145
Song type # of songs
ranchera 21,947
bolero 10,522
corrido 7,393
canción 5,410
polka 4,742
canción ranchera 2,736
cumbia 2,055
vals 1,399
5. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
5
The collection
• ~700 unique song tags/keywords (prior to translation)
• All songs tagged with 1-20 keywords (avg ~4.5)
6. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
6
7. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
7
Chris Strachwitz
8. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
8
Arhoolie Records
9. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
9
10. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
10
11. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
11
12. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
12
Supporters of the collection
13. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
13
Research using the Frontera
collection as a primary source
14. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
14
A “multimedia encyclopedia”
15. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
15
More metadata, more problems
• No authority values employed for person and group
names; “name hacking” used to approximate
uniqueness
• Relationship between song, album, and “release” is not
consistent
• Authority data for song entities is better: matrix numbers
and catalog numbers are available
• Collection is entirely “siloed” on its current site, largely due
to its homegrown metadata scheme
16. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
16
• Adopt metadata structures of open online music
encyclopedias (MusicBrainz)
• Use unique IDs from linked open data knowledge
bases to identify people, groups, companies,
songs, albums, etc.
• Adopting IDs from external LOD sites lets us link
out to these related records
• When records are missing from external LOD
knowledge bases, add them to those sites
automatically
Goal: incorporate Frontera into
the broader semantic web
17. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
17
LOD records and relations
18. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
18
LOD records and relations
19. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
19
LOD records and relations
20. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
20
Inspiration: Linked Jazz, NYPL
Labs’ ECCO, LD4L
21. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
21
Inspiration: Linked Jazz, NYPL
Labs’ ECCO, LD4L
22. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
22
Inspiration: Linked Jazz, NYPL
Labs’ ECCO, LD4L
23. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
23
LOD integration: phase 1
Initial metadata cleaning and preparation
• Identify likely unique entities (names, etc.) via “fuzzy
matching,” e.g., MD5 hash comparisons
• Challenge: finding methods that scale to >100,000 rows
(many approaches must be scripted)
• May necessitate creation of Yet Another Database
• Generate audio fingerprints of music files
24. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
24
LOD integration: phase 2
Discovery and linking of existing records
• Entity lookup in LOD knowledge bases
• Audio fingerprint lookups in AcoustID database, which
links to MusicBrainz
• Search for artist, group, and composer names in service
APIs (note: these work better with English than Spanish)
• DBpedia Spotlight
• MusicBrainz
• Discogs
• VIAF, LCNAF (worth a try)
• Combination of automated and crowd-sourced verification
of links, integration into site
25. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
25
LOD integration: phase 3
Contributing/creating new records
• Unsolicited bulk record generation may be seen as linked
data spam and rejected (“notability” problem)
• Direct communication and participation in knowledge
base’s community is the most promising approach
• Case study: discussion with MusicBrainz community
• Voting/editorial review system can be incompatible with
bulk updates, but the community may be willing to
accommodate
• Data records should be well formed and clean; upload
methods must be tested and the upload coordinated
with LOD admins
26. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
26
27. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
27
LOD integration: the “bot” option
28. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
28
LOD integration: crosswalks
between repositories and records
29. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
29
Progress to date
• Used metadata cleaning approaches to identify most likely
unique names in the DB
30. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
30
Progress to date
• Used metadata cleaning approaches to identify most likely
unique names in the DB
• Applied acoustic fingerprinting to all 116,000 audio files
• matched 1,313 songs
• following the AcoustID links to MusicBrainz positively
identifies ~287 artists with their records in MusicBrainz
(as well as Discogs and DBpedia)
31. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
31
Progress to date
• Used metadata cleaning approaches to identify most likely
unique names in the DB
• Applied acoustic fingerprinting to all 116,000 audio files
• matched 1,313 songs
• following the AcoustID links to MusicBrainz positively
identifies ~287 artists with their records in MusicBrainz
(as well as Discogs and DBpedia)
• Ran DBpedia Spotlight on all artists and composer names,
correlated matched entities with MusicBrainz, Wikidata IDs
32. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
32
Progress to date
• Used metadata cleaning approaches to identify most likely
unique names in the DB
• Applied acoustic fingerprinting to all 116,000 audio files
• matched 1,313 songs
• following the AcoustID links to MusicBrainz positively
identifies ~287 artists with their records in MusicBrainz
(as well as Discogs and DBpedia)
• Ran DBpedia Spotlight on all artists and composer names,
correlated matched entities with MusicBrainz, Wikidata IDs
• Searched for artist and composer names via MusicBrainz,
Discogs, and VIAF APIs
33. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
33
Entity matching to LOD sites
Artists on label
(out of 13,752)
Artists on sleeve
(out of 7,035)
Composers
(out of 24,211)
Acoustic
fingerprinting
287 (for all names)
DBpedia
Spotlight
272 27 72
MusicBrainz
lookup
620 434 1,151
Discogs
search API
4,929 3,502 9,423
VIAF search
API
3,707 3,057 8,889
*These are likely in order of decreasing accuracy!
34. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
34
Concerns/next steps
• Scalable approaches for Q/A of data (new and old)
• Discoverability and usability for humans and machines
(APIs)
• Repository integration: adopting a linked data model will
help
• Trusted channels for upload to existing knowledge bases:
design a formal model?
• Work with specialized sub-collections of knowledge bases
(topics, regions)?
• Test DBpedia Spotlight w/Spanish data pack
• Does using links to existing LOD entries just reinforce
inequality of artist exposure (“rich get richer”/LOD “echo
chamber”)?
35. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
35
Thanks!
UCLA Digital Library
• Lisa McAulay
• Kristian Allen
• T-Kay Sangwand
• …everyone else (past and present)
Arhoolie Foundation
• Tom Diamant
• Chris Strachwitz (obviously)
36. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
36
Thanks!
37. Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR2016
37
Peter M. Broadwell
@peterbroadwell
broadwell@library.ucla.edu
Martin Klein
@mart1nkle1n
martinklein@library.ucla.edu
Let the Music Live/
que viva la música
Editor's Notes
Thanks! Title is kind of a mouthful – at least it’s bilingual, like the collection I’ll be talking about. I know this is a conference about repositories rather than collections, but as you can see from the title, I’ll be talking about some exploratory work we’ve done to investigate how to integrate this collection and other like it into the broader world of linked data, and in this case specifically linked open data. We have a collection that’s an especially good candidate for this treatment: a multimedia collection of unique cultural materials that is quite extensive and also about 15 years old now, but so far has remained largely isolated online. We’ve recently been researching ways to make it part of the web, not merely on the web.
Here’s the current site for the collection. More statistics are in the next slide, but first, given that the headline of the talk is “let the music live” (and the title of this entire session is “let the content sing”, I figure we should listen to a bit of it.
The song: a bitingly satirical two-part corrido, “El Lavaplatos” (the dishwasher) recorded by Los Hermanos Bañuelos in LA in the 1920s, tells the first-person story of a Mexican immigrant who seeks success in Hollywood but finds only menial labor and dashed dreams, and returns to Mexico more broke than before.
Vital statistics about the materials in the collection. Digitization began in 2001 and has continued to… today.
Most of the recordings are technically still in copyright, but many would qualify as “orphan works” (the original label is long since defunct, and ownership of its remaining assets is unclear). So the sound content of the collection is not truly “open” – users outside UCLA are limited to a 90-second snippet. We’d love to provide more, and maybe we will some day.
More statistics about the metadata – the concept of “unique names” is a bit problematic for us, as I’ll discuss soon.
The music is actually quite a range of genres, from all parts of Mexico and the southern US. Most recognizable genres would be mariachi ensembles and two-person ballads with voice and guitar. The subject matter of the songs is quite varied as well, but fortunately we have metadata about this, too. Not lyrics in most cases, but tags.
Given this impressive number of song tags, I saw an opportunity to create a visual overview of the human-assigned descriptors of the songs. Maybe you’ve seen “genre maps” like this for LastFM or Spotify. Those usually take a few days on a high-performance computing cluster running deep learning self-organizing map algorithms. All I had was my laptop, so I just used Gephi. If two tags were used for the same song, I put a line between them, then generated a network layout.
Note that there are 2 types of nodes – song genres(mostly in Spanish) and tags (in English). It’s pretty revealing to see how they are positioned relative to each other. Ex: corrido.
Given this impressive number of song tags, I saw an opportunity to create a visual overview of the human-assigned descriptors of the songs. Maybe you’ve seen “genre maps” like this for LastFM or Spotify. Those usually take a few days on a high-performance computing cluster running deep learning self-organizing map algorithms. All I had was my laptop, so I just used Gephi. If two tags were used for the same song, I put a line between them, then generated a network layout.
Note that there are 2 types of nodes – song genres(mostly in Spanish) and tags (in English). It’s pretty revealing to see how they are positioned relative to each other. Ex: corrido.
A little more historical background on the collection: it was begun by Chris Strachwitz, who was a descendant of aristocracy in what is now Poland before WWII, eventually came to the US as a refugee after the war, got really into US popular music, moved to California, went to Pomona and then Berkeley, became an avid record collector – the more obscure the better – is friends with Les Blank, had a radio show on KPFK, founded Arhoolie records in 1960 and started collecting rare 78s of Mexican and Mexican American music around then.
Here’s Arhoolie records and its retail outlet, the Down Home Record Store, on San Pablo Ave in El Cerrito, just north of Berkeley
The Arhoolie Foundation’s website was blocked by the Web filter at the UCLA Conference Center for being an “advocacy organization.”Maybe Donald Trump is in charge of the firewall???
There have now been several iterations of the Frontera collection site; this is the most recent, built in Drupal using the multi-lingual interface module.
So far we’ve largely avoided the problems with a lack of multi-lingual support In institutional repositories by not using one – the metadata is ONLY in Drupal, and the music content is stored on an Isilon file system connected to a streaming server.
The collection has received quite a bit of support from various sources over the past 15 years, for digitization and access. This has resulted in a large collection but with its share of growing pains, as I’ll discuss in a minute.
At least one book has already been written using the collection as a primary source (based primarily on the digitization of the first 45,000 or early 78s)
It has also been described as a multimedia encyclopedia, since it’s not just music – it’s images and official metadata, plus user-submitted metadata including lyrics. This is the part we’d really like to push now – augmenting participation in the collection by scholars and enthusiasts, and improving the visibility of the materials on the wider web. It turns out these goals are interrelated, and they might also help to address some of the metadata issues that have developed over the past decade as more and more content has been layered onto the site, often without the benefit of very much advance planning.
These are some of the challenges now facing us that are necessary to overcome to integrate the collection into the wider web, but conversely, the integration process will actually help us to resolve some of these issues
Song, album, release: for 78s they’re all the same thing, more or less, but for LPs and cassette tapes, the paradigm breaks down.
These are the goals for our “linked data outreach” project. This is something that we’ve wanted to do with the Frontera collection for a long time, but haven’t had the time to consider until now.
Some of the linked open data repositories we’d like the Frontera records to link out to.
National Library of Wales is doing some interesting work that involves uploading records to Wikidata. This could work as well for music data, but if possible we’d prefer to work with MusicBrainz, which seeks to be the central source of song, artist, album, etc. authority IDs on the web.
Dbpedia is another model
Discogs is semi-commercial, but has a pretty active and committed user base, mostly of electronic music DJs, but anyway it’s potentially larger than MusicBrainz’s (note that MB is technically nonprofit, despite providing most of the Google Knowledge Graph data for music and musicians).
Projects that have already prototyped LOD integration for music-related archives: Linked Jazz. We can use the NYPL Labs’ crowd-sourcing interfaces to add community-submitted data to Frontera, and to verify inter-knowledge base links.
This is a demo of one of the NYPL Labs’ very addictive crowdsourcing interfaces; this one involves selecting the proper DBpedia entity (Wikipedia article) match for a name.
It’s usually the one with a picture.
Note: package solutions like MusicBrainz Picard (basically a non-commercial iTunes on steroids) are great, but probably don’t scale to 100k nodes.
Audio fingerprints as the best authority source for music. Mention Shazam, maybe Hatto case?
More experimental/challenging (this is harder)
This is a discussion I kicked off a few weeks ago on the MusicBrainz forums about what softare would be necessary and what the policy implications would be of bulk upload of the majority non-overlapping records from a culture heritage data set like Frontera to MusicBrainz
Posted to the forum discussion: what happens when upload bots are allowed Note that this is not necessarily a bad thing, though in this case it was a Japanese EDM music bot that was submitting a lot of bad data, so it wasn’t necessarily a net win.
MusicBrainz has decided it has no choice but to trust Discogs. So if there’s an entity in Discogs, there exist scripting tools to fashion a new MusicBrainz entity to submit.
Note: the fact that so few (usually at most 30%) of entities in the Frontera data set are matched in any of these repositories is potentially a good outcome; it means that most of the entities in Frontera are not attested anywhere else, and devising a workflow in which we auto-generate LOD records for these new entities that are then uploaded to an archive like MusicBrainz will play a huge role in increasing the visibility of these musicians, who otherwise would eventually have been forgotten.