Archives, algorithms and people


Published on

How we put the BBC World Service radio archive online using machines and crowdsourcing. A talk given to the UK Museums on the Web conference, November 2013.

One of the major challenges of a big digitisation project is you simply swap out an under-used physical archive for its digital equivalent. Without easy ways to navigate the data there's no way for your users to get to the bits they want. We recently worked with the BBC World Service to generate metadata for their radio archive, 50,000 programmes from over 45 years. First using algorithms to generate "good enough" topics to put the archive online and then using crowd-sourcing to improve the data.

Throughout 2013 we have been running this experiment to crowdsource improvements to the metadata that we automatically created. At people can search and browse for programmes, listen to them, correct and add new topics.

This talk describes how we went about this and what we've learnt with this massive online multimedia archive - about understanding audio, automatically generating topics and crowdsourcing improvements to the data.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • I'm Tristan, from BBC Research & DevelopmentI’m not sure I should be here, I’m not from a museum
    We do R&D for the BBC and media industry, amongst other things includes some work with the BBC archive
    I’m going to talk about an archive, algorithms and peopleOur team had a challenge - a big radio archive, sparsely described, to put onlineWe’re an R&D department, we like challengeswe even had some solutions looking for problems!The BBC typically puts archives online by editorially curating them, so only a subset is exposed. We don't dump huge collections online.But we thought we wouldOur aim was to put *all of it* online as efficiently as possibleSo we built a prototype and over the past year we've run an experiment
  • We had an opportunity to work with the archive of the World Service English language radio service
    They'd digitised their archive as they were having to move out of their historic home at Bush House, into the new Broadcasting House in central London
    The archive contained about 70k radio programmes from over 45 years. Not everything, there's no live news bulletins, they weren't recorded, and just English language service
  • The graph shows how the archive is distributed in time The spike starts in the 90s where we started to use digital technologies to record things. And stopped recordings over old tapes
  • The digitisation process created very high quality digital audio of all the programmesBut artefacts of that process (and indeed earlier archiving) meant that the metadata describing the programmes was sparse - often missing fields or having incorrect data.
    And if there's no data describing the programmes then no-one will be able to find themIt’s a danger with a big digitisation project - that you simply swap out an under-used physical archive for its digital equivalent.
    Without easy ways to navigate the archive there's no way for your users to get to the bits they want. And to navigate the archive you need data
  • So that was our challenge
    We wanted to demonstrate how to create the data needed to put a massive media archive online using algorithms, linked data and crowdsourcing. And this is how we did it
  • We needed to generate data primarily from what we did have - the digital audio from the radio programmes
    We used CMU Sphinx, an open-source speech recognition toolkit, to listen to every programme and convert it to text
  • Speech recognition can be very good, particularly when trained on a single speakerBut on these radio programmes, with varying recording qualities, many speakers and accents from around the world, it really struggled for accuracy, and we ended up with lots of pretty noisy transcriptsBut we didn't need accurate transcripts, just some good metadata
  • Our team developed algorithms that could reliably extract tags or keywords from these noisy transcripts
    We use Linked Data to provide unique tags (e.g. to disambiguate Paris, France from Paris Texas), to help this topic extraction, to relate tags to one another, and ultimately to link to elsewhere on the web
    We actually use dbpedia, a data version of wikipedia, as our reference. So every tag in our archive is linked to a wikipedia page
    For every programme, even if there was no metadata to start with, we generated 10-20 tags
    We had a lot of data to process, it was about 26k hours of audioDoing this audio transcription and topic extraction would have taken 36k hours on a single computer But using the cloud we could do this all in parallel and we processed it all in 2 weeks at a cost of around $3k
  • The automatically created tags weren't always correct and we couldn't go through them all to check
    Our hypothesis was that they were good enough to bootstrap an online archive for people to use and listen
    And then we could ask those people to help correct and add to these tags - to crowdsource the problem
    This is the prototype we built featuring the archiveYou can search and browse for programmes, listen to them, correct the topics and tag them with new topics
    The homepage shows some featured programmes, or you can search – Iceland
    Filter by decade
    Listen to the programme (extract plays)
    Worth noting that the original metadata wasn’t created for public consumption so it can be pretty perfunctory
    These are the automatic topic tags
    you can vote them up if they're correct, or down if they're wrong
    And you can add a new tag, corresponding to a wikipedia page
    It's registration only, but easy to sign up at the URL shown
  • Homepage
  • (Worth noting that the original metadata wasn’t created for public consumption so it can be pretty perfunctory)
    A note about images: we didn’t have any to start with. But people expect images on the web, so we use the tags to find images from Ookaboo, a repository of CC images, and users can choose alternate ones
  • Users can also directly edit the programme title and synopsis to correct spelling mistakes. It uses a Wiki-model to track changes and small admin interface for us to "clear" them
  • Here you can see the list of tags for a programme, some automatically generated, some with user votes, and some added by users
    Rather than the wiki-model we use a voting model here, more like reddit
    As mentioned, everything has to be wikipedia concept
  • One last bit of the prototype, this is very cool, we are also doing speaker identification and segmentation automatically on some programmes
    We can recognise distinct voices within, and across, programmes. The only thing we can't do is identify who it is that is speaking
    This shows a magazine programme (From Our Own Correspondent) that the algorithm has divided into the different contributors, users can then label those voices.
    And these names then propagate across the archive to wherever else that voice was heard.
  • To recap
    Started with a media archive with little metadata
    Process it with machines in the cloud
    Use that data to create an online experience
    Get people to use it and improve the data
    ...and we want to feedback these improvements to the machines to help them learn
    For example, we can look for tags that are often voted down and then look for patterns in them.
  • It's been running for about a year, fairly low key until recently. It's an experiment, not part of the BBC's "main" site so we don't get massive traffic
  • (70,000 programmes in the archive36k listenable programmes, 34k unlistenable (either because of rights issues or because the actual audio is missing but a record was created for a programme)
    1 million machine generated tags
    Currently 3000 registered users
    71,000 edits (some kind of action from a user - either votes on tags, speaker ID, synopsis edits, image votes)
    70,000 tag edits (57k tag votes, 13k new tags)
    1000 synopsis edits
    21% of listenable programmes have an edit, at 9 edits/prog.
    36% of listenable progs have been listened to at least once (30k total listens)
  • listeners even sent in 4 "lost" programmes that they had recorded off-air
  • Machine-generated tag quality looks OK
    Human-edited tags are good, I'd almost never disagree with them
    But this is hard to answer objectively. When is a tag correct? In who's opinion? Even harder - When does a programme have a “complete” set of tags?
    It's a large and sparse space of data
    Currently doing some analysis, doesn’t seem to be much prior work on the quality of crowdsourced tags, please shout if you know of any
    Also more work to do to analyse what kind of tags are added
    Also interested in whether people listen to the programme before tagging? A bit different to looking at a painting or photograph.
    Surprising amount of synopsis editing - spelling, adding comments, adding presenters, one person particularly likes adding episode numbers!
  • As it turns out, only a few!
    1 person (king of radio drama community) has astonishingly done 30% of the edits
    10 people have done 70% of the edits
    Other crowdsourcing studies have shown that typically 10% of users do the majority of the work
    10% of ourusers have done 98% of the work
    The internet has a 1% rule - 1% create, 9% modify, 90% just view
    About half of our users have done at least one edit
    But that doesn't really tell you much they’ve done, or how long do they stick around?
    “Active users” - term borrowed from wikipedia - someone who has done some edit action in the last 30 days
    Active users currently around 2%
    We've noticed particularly groups of people using the prototype.Started with a large community of radio drama enthusiasts who were cataloguing all the drama and playsAnd more recently Frank Zappa fans found some interviews
    So do you only need 100 people to do this? Without the archive of programmes and the prototype drawing people in we wouldn't have found the "right" 100 who care enough to help
  • Some pretty pictures we drew using the data, giving an idea of some of the archive contents and activity
  • User tags, clustered by the programme they're attached to
  • Users, clustered by what they listened to - "The Lobster"
  • As we’ve got Wikipedia-mapped tags we can look for programmes about places...
  • Links from current news events being talked about on the BBC News channelback to programmes in the archive
  • My favourite of the things we foundFrom 1957 - The last broadcast from the BBC Danish Service
    “Entirely in Danish”
  • Quality of original metadata was mixed
    We've significantly improved it with algorithms and crowdsourcing, adding semantic topics to the programmes
    Couldn’t have created a decent online archive otherwise, we just didn’t have the data
    Also efficient the initial research & development cost was less than our estimated cost of professionals tagging everything
    And this tech cost is a one-off, and re-usable, obviously becomes cheaper the more times we use the tech, it can be used for any media with people speaking
    Crowdsourcinga bit stuck in middle of different crowdsourcing approaches
    If you know Galaxy Zoo and its projects - these are generally designed to be task focused with particular targets. This wasn't designed like that
    It was more of a browsable archive with crowdsourcing features (maybe closer to wikipedia)
    We don't know what's right, but we've managed to create quite a lot of data, would be interesting to compare approaches
    And we've tried both wikipedia "last edit wins" approach and reddit voting approach
  • Some things we still don't know:How good are the tags? Like I said, it's difficult to measure objectively
    How much volunteer effort do you need? It depends. How big is the archive? How much data do you need per item? How good does the data need to be?
    Ultimately, when is your data good enough?
  • Register for the prototype
    Read more on our website and blog
    A number of components of the system have been open-sourced on github
    In this project we did a lot of work to manage the processing of the audio, we found this so useful that we're turning it into a generic platform, called Comma, for anyone to analyse media, and for any computer scientists to run their analysis algorithms.
  • Archives, algorithms and people

    1. 1. Archives, algorithms and people or How we put the BBC World Service radio archive online using machines and crowdsourcing Tristan Ferne / @tristanf Executive Producer BBC Research & Development
    2. 2. The BBC World Service archive
    3. 3. 1947-2012
    4. 4. The missing metadata Missing data Spelling mistake Sometimes incorrect data No semantic data
    5. 5. How it works
    6. 6. Listening machines
    7. 7. Noisy transcripts
    8. 8. Algorithms
    9. 9. Algorithms and people
    10. 10. The prototype
    11. 11.
    12. 12. Show Synopsis editing version
    13. 13.
    14. 14. Machine learning
    15. 15. Results
    16. 16. How much data? 70000 programmes 36000 1m machine tags 21% 3000 users listenable programmes 71000 edits of programmes tagged 36% of programmes listened to 70000 tag edits 1000 synopsis edits
    17. 17. And four lost programmes
    18. 18. How good is the data? Tags are a large and sparse space When is a tag correct? When is a programme tagged completely? How do you measure crowd-sourced data?
    19. 19. Who does the work? 10% of people = 98% of edits 10 people = 70% of edits 1 person = 30% of edits
    20. 20. The shape of the archive
    21. 21. Places mentioned
    22. 22. Linking from the News
    23. 23. The Last Danish Christmas Broadcast “Entirely in Danish”
    24. 24. What we’ve learnt We can significantly improve the data It’s cost-effective with re-usable technology A crowdsourcing approach
    25. 25. Open questions How good are the machine tags? How much crowdsourcing do you need? When is your data good enough?
    26. 26. @tristanf