Karen Cariani discusses using computational tools and crowdsourcing games to increase metadata for a digital archive of 72,000 television and radio programs with incomplete records. A game called "Fix It" on the American Archive of Public Broadcasting website allows users to correct transcript errors to generate accurate metadata that can be searched, caption videos, and used for other research. When transcripts are corrected through the game, they are stored on the archive's servers and made available to search and access alongside the media.
08448380779 Call Girls In Civil Lines Women Seeking Men
Using Computational Tools and Crowdsourcing Games to Increase Metadata and Discoverability of Digital Collections
1.
2. Karen Cariani
AAPB Project Director, WGBH
Senior Director, WGBH Media Library &
Archives
Using computational tools and
crowdsourcing games to increase
metadata and discoverability of digital
collections
8. the situation
■72,000 digitized television and radio programs
■incomplete, inaccurate metadata records
■limited staff resources
■we need to know what we have in the
collection
■we have a responsibility to users to provide
access to the collection
■continued growth of the collection (content
9. potential: transforming content
into data
Computational Tools
Speech-to-text
Audio analysis
Image Analysis
Visualization of Data
How can we use them?
19. once corrected…
• JSON transcripts will be stored on AAPB’s Amazon S3 account
•
Transcripts will be indexed for keyword searching on the AAPB website
• Transcripts will be made available alongside the media on the record
page
• Transcripts can play as captions within the player
• Transcripts can be harvested via an API and used as a dataset for
research such as a digital humanities project
And we are talking about ….. Using computational tools and crowdsourcing games to increase metadata and discoverability of digital collections
We are WGBH, Pop-Up Archive and University of Texas at Austin School of Information. I am going to let Anne and Tanya introduce themselves and their organizations. We are discussing a project generously funded by IMLS. I am Karen Cariani, Senior Director of WGBH media library and archives, and project director for the american archive of public broadcasting.
I am going to give a quick introduction on WGBH and the American Archive. WGBH as many of you know is the premiere pubic broadcasting station in Boston, producer of many core PBS programs such as NOVA, Frontline, American Experience, Antiques Roadshow, Masterpiece Theater. We are not only a TV producer and broadcaster, but we manage 2 radio stations in Boston area and 1 servicing the Cape and the Islands. And we oversee a TV station in western Ma.
The American Archive is a collaboration between the Library of Congress and WGBH with a goal to preserve and make accessible significant public radio and television programs before they are lost to posterity. The American Archive is a digital archive with a website, americanarchive.org, the homepage of which you see here. Users anywhere in the U.S. can access a wide range of historical public television and radio programs from the late 1940s to the present. Our primary objective is to preserve public media and assure discoverability and access through a coordinated national effort. In doing this, we support content creators and current stewards of the materials, and facilitate the use of historical public broadcasting by researchers, educators, students, and others.
As an aggregator of content, AAPB hopes to provide a centralized web portal of discovery for public media materials. The collection is growing with new additions. Access for research, educational, and informational purposes only. Due to rights restrictions, a portion (about 20,000 items) are available through our On-line Reading Room anywhere in the US. These items are also soon to be harvested by Digital Commonwealth and eventually available through the DPLA. Inclusion in the ORR is determined by analysis of types of programs and examination of individual series and programs – more is added as we have time to assess the materials. However, the entire collection of over 72,000 items is available for viewing on location at the Library of Congress and WGBH.
As part of the initial project funded by CPB, the AAPB has 72,000 digitized tv and radio programs from about 100 stations across the country. Along with these digital files we received incomplete metadata records with very little descriptive data about the content or the program. We have limited staff resources to fully catalog the 72,000 items. We figured it would take a full time person about 32 years to watch everything, spending only 15 minutes per item cataloguing to complete the collection, all while we adding up to 25,000 items in annually. So you can do the math and figure out that even if we could afford a team of 10 people to just catalogue full time (and that is over ½ of my current staff), it would still take a long time and we would barely catch up cataloguing the new acquisitions. However, we need to know what we have, (it helps us determine rights and what we can make accessible) and we need to be able to make it findable for users, and do that, currently, we need to be able to expose text for search engines and indexers.
So how to do you transform large amounts of audio and video into something searchable for search engines and indexers? How can we transform it into a dataset?
We thought, this is a great opportunity for collaboration with computational tools and computer science field, but we need to understand each others work and the capabilities of what exist. Here are some of the tools available that can help us with our dilemma. With this IMLS funded project we are working with Pop-up archive to create speech to text transcripts of the entire collection, and with UT Texas to analyze the audio to help further identify speakers and sounds. And we will use a crowdsourcing game to help correct or fix the computer generated transcripts which will hopefully help further train the tools to improve..
We will not talk about image analysis.
Experience has shown that most speech to text tools don’t output clean transcripts. Accurate transcripts are dependent on audio quality, speaker accents, background noise, etc, Given that our collection is from 100 different local tv and radio stations across the country, the variety of audio and audio quality varies widely. Some programs are in Spanish, some are musical performances, and nearly all begin with standard bars and tone for video recordings. The speech to text tool tries to interpret these sounds as text, and it makes a number of other mistakes too. WGBH has created a web based game to allow the public to help us fix and correct these transcripts.
You are welcome to follow long with me if you have a computer as I walk through the game, and encouraged to play afterwards.
The game has a terms of use that we need players to check off to make sure they understand that they can not use the content for anything but helping us correct the transcripts. We’ve kept the clips to only 5 mins in order to be able to take advantage of fair use.
There are 3 games you can play – identify errors, suggest fixes, and validate fixes. You gain points for each action taken.
You can set preferences on the type of content you would like to interact with. Or you can pick which station’s content you would like to work on. We are hoping to perhaps get stations to compete with each by getting their station volunteers and community to play against each other for more points. But we need to do a bit more development for that.
Each iteration of a game lasts 5 minutes. But you can play multiple times for any length of time. Three lines of the transcript are active at once. You listen to the audio, see the line highlighted and click on it if there is a mistake. There are instructions and guides on what is considered an error and how to mark it. It take a little bit to figure it out, but after a few times you can pick it up pretty quickly.
Game 2 you correct things that have been tagged as an error or mark it as not an error.
And game 3 you validate the corrections that have been made. You are given choices that have been fixed to pick the correct one.
The game board keeps track of points and players. And highlights top scorers. Studies have shown that people play these games for personal satisfaction and a competition doesn’t necessarily increase the desire to play. We hope people will be driven just by the personal satisfaction of getting points and helping us out as oppose to competing against anyone in particular.
Once the transcripts have been verified, the JSON transcripts will be stored in the AAPB’s Amazon S3 account and indexed for keyword searching on the AAPB website. The transcripts will be made available alongside the media on the record page. They can also be played like captions within the video player. And they will be able to be harvested via an API to be used as a data set for research. We are hoping that researchers will begin to look at the collection as a data set and start trying to see trends from programming over the last 60 years. Particularly across news programs.
Be sure to play and tell all your friends about it.