Using Computational Tools and Crowdsourcing Games to Increase Metadata and Discoverability of Digital Collections

•Download as PPTX, PDF•

0 likes•64 views

Karen Cariani discusses using computational tools and crowdsourcing games to increase metadata for a digital archive of 72,000 television and radio programs with incomplete records. A game called "Fix It" on the American Archive of Public Broadcasting website allows users to correct transcript errors to generate accurate metadata that can be searched, caption videos, and used for other research. When transcripts are corrected through the game, they are stored on the archive's servers and made available to search and access alongside the media.

Technology

Karen Cariani
AAPB Project Director, WGBH
Senior Director, WGBH Media Library &
Archives
Using computational tools and
crowdsourcing games to increase
metadata and discoverability of digital
collections

the situation
■72,000 digitized television and radio programs
■incomplete, inaccurate metadata records
■limited staff resources
■we need to know what we have in the
collection
■we have a responsibility to users to provide
access to the collection
■continued growth of the collection (content

potential: transforming content
into data
Computational Tools
Speech-to-text
Audio analysis
Image Analysis
Visualization of Data
How can we use them?

a crowdsourcing game
http://fixit.americanarchive.org

once corrected…
• JSON transcripts will be stored on AAPB’s Amazon S3 account
•
Transcripts will be indexed for keyword searching on the AAPB website
• Transcripts will be made available alongside the media on the record
page
• Transcripts can play as captions within the player
• Transcripts can be harvested via an API and used as a dataset for
research such as a digital humanities project

facebook.com/amarchivepub
@amarchivepub
americanarchive.org
http://fixit.americanarchive.or
#FixItAAPB

Similar to Using Computational Tools and Crowdsourcing Games to Increase Metadata and Discoverability of Digital Collections

Let the Public and the Computer do the Metadata Work!WGBH Media Library and Archives

Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...AIST

Liferay and Big DataMiguel Pastor

Resource sync overview and real-world use cases for discovery, harvesting, an...openminted_eu

ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...Martin Klein

Semantic Search overview at SSSW 2012Peter Mika

publishing productionEssam Obaid

The Real-time Web in the Age of AgentsJoshua Shinavier

AWS Summit Sydney 2014 | Closing Keynote - Dr Werner Vogels, VP & CTO, Amazon...Amazon Web Services

How AI connect dots for IoTAmazon Web Services

Building the Inform Semantic Publishing Ecosystem: from Author to AudienceVital.AI

Semtech bizsemanticsearchtutorialBarbara Starr

SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...Micah Altman

AWS Enterprise Day | Closing Keynote, Singapore - Dr Werner VogelsAmazon Web Services

Beyond the Fridge, The World of Connected Data - Dr Werner VogelsAmazon Web Services

APIS. Digitale biographische Blütenleseeveline wandl-vogt

Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral

Hadoop and Hive at Orbitz, Hadoop World 2010Jonathan Seidman

Graphics101bthat

FFL & CNYHLeah Kraus

Similar to Using Computational Tools and Crowdsourcing Games to Increase Metadata and Discoverability of Digital Collections (20)

Let the Public and the Computer do the Metadata Work!

Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...

Liferay and Big Data

Resource sync overview and real-world use cases for discovery, harvesting, an...

ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...

Semantic Search overview at SSSW 2012

publishing production

The Real-time Web in the Age of Agents

AWS Summit Sydney 2014 | Closing Keynote - Dr Werner Vogels, VP & CTO, Amazon...

How AI connect dots for IoT

Building the Inform Semantic Publishing Ecosystem: from Author to Audience

Semtech bizsemanticsearchtutorial

SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...

AWS Enterprise Day | Closing Keynote, Singapore - Dr Werner Vogels

Beyond the Fridge, The World of Connected Data - Dr Werner Vogels

APIS. Digitale biographische Blütenlese

Jeremy cabral search marketing summit - scraping data-driven content (1)

Hadoop and Hive at Orbitz, Hadoop World 2010

Graphics101

FFL & CNYH

Recently uploaded

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Slack Application Development 101 Slidespraypatel2

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Key Features Of Token Development (1).pptxLBM Solutions

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Pigging Solutions in Pet Food ManufacturingPigging Solutions

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Install Stable Diffusion in windows machinePadma Pradeep

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization

Slack Application Development 101 Slides

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Maximizing Board Effectiveness 2024 Webinar.pptx

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

08448380779 Call Girls In Friends Colony Women Seeking Men

Unblocking The Main Thread Solving ANRs and Frozen Frames

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Key Features Of Token Development (1).pptx

How to Remove Document Management Hurdles with X-Docs?

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Benefits Of Flutter Compared To Other Frameworks

Pigging Solutions in Pet Food Manufacturing

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Install Stable Diffusion in windows machine

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

08448380779 Call Girls In Civil Lines Women Seeking Men

Using Computational Tools and Crowdsourcing Games to Increase Metadata and Discoverability of Digital Collections

2. Karen Cariani AAPB Project Director, WGBH Senior Director, WGBH Media Library & Archives Using computational tools and crowdsourcing games to increase metadata and discoverability of digital collections

3. Or… “Can the Computer Do the Work?”

8. the situation ■72,000 digitized television and radio programs ■incomplete, inaccurate metadata records ■limited staff resources ■we need to know what we have in the collection ■we have a responsibility to users to provide access to the collection ■continued growth of the collection (content

9. potential: transforming content into data Computational Tools Speech-to-text Audio analysis Image Analysis Visualization of Data How can we use them?

10.

11. a crowdsourcing game http://fixit.americanarchive.org

12.

13.

14.

15.

16.

17.

18.

19. once corrected… • JSON transcripts will be stored on AAPB’s Amazon S3 account • Transcripts will be indexed for keyword searching on the AAPB website • Transcripts will be made available alongside the media on the record page • Transcripts can play as captions within the player • Transcripts can be harvested via an API and used as a dataset for research such as a digital humanities project

20.

21. facebook.com/amarchivepub @amarchivepub americanarchive.org http://fixit.americanarchive.or #FixItAAPB

Editor's Notes

And we are talking about ….. Using computational tools and crowdsourcing games to increase metadata and discoverability of digital collections
We are WGBH, Pop-Up Archive and University of Texas at Austin School of Information. I am going to let Anne and Tanya introduce themselves and their organizations. We are discussing a project generously funded by IMLS. I am Karen Cariani, Senior Director of WGBH media library and archives, and project director for the american archive of public broadcasting.
I am going to give a quick introduction on WGBH and the American Archive. WGBH as many of you know is the premiere pubic broadcasting station in Boston, producer of many core PBS programs such as NOVA, Frontline, American Experience, Antiques Roadshow, Masterpiece Theater. We are not only a TV producer and broadcaster, but we manage 2 radio stations in Boston area and 1 servicing the Cape and the Islands. And we oversee a TV station in western Ma.
The American Archive is a collaboration between the Library of Congress and WGBH with a goal to preserve and make accessible significant public radio and television programs before they are lost to posterity. The American Archive is a digital archive with a website, americanarchive.org, the homepage of which you see here. Users anywhere in the U.S. can access a wide range of historical public television and radio programs from the late 1940s to the present. Our primary objective is to preserve public media and assure discoverability and access through a coordinated national effort. In doing this, we support content creators and current stewards of the materials, and facilitate the use of historical public broadcasting by researchers, educators, students, and others.
As an aggregator of content, AAPB hopes to provide a centralized web portal of discovery for public media materials. The collection is growing with new additions. Access for research, educational, and informational purposes only. Due to rights restrictions, a portion (about 20,000 items) are available through our On-line Reading Room anywhere in the US. These items are also soon to be harvested by Digital Commonwealth and eventually available through the DPLA. Inclusion in the ORR is determined by analysis of types of programs and examination of individual series and programs – more is added as we have time to assess the materials. However, the entire collection of over 72,000 items is available for viewing on location at the Library of Congress and WGBH.
As part of the initial project funded by CPB, the AAPB has 72,000 digitized tv and radio programs from about 100 stations across the country. Along with these digital files we received incomplete metadata records with very little descriptive data about the content or the program. We have limited staff resources to fully catalog the 72,000 items. We figured it would take a full time person about 32 years to watch everything, spending only 15 minutes per item cataloguing to complete the collection, all while we adding up to 25,000 items in annually. So you can do the math and figure out that even if we could afford a team of 10 people to just catalogue full time (and that is over ½ of my current staff), it would still take a long time and we would barely catch up cataloguing the new acquisitions. However, we need to know what we have, (it helps us determine rights and what we can make accessible) and we need to be able to make it findable for users, and do that, currently, we need to be able to expose text for search engines and indexers. So how to do you transform large amounts of audio and video into something searchable for search engines and indexers? How can we transform it into a dataset?
We thought, this is a great opportunity for collaboration with computational tools and computer science field, but we need to understand each others work and the capabilities of what exist. Here are some of the tools available that can help us with our dilemma. With this IMLS funded project we are working with Pop-up archive to create speech to text transcripts of the entire collection, and with UT Texas to analyze the audio to help further identify speakers and sounds. And we will use a crowdsourcing game to help correct or fix the computer generated transcripts which will hopefully help further train the tools to improve.. We will not talk about image analysis.
Experience has shown that most speech to text tools don’t output clean transcripts. Accurate transcripts are dependent on audio quality, speaker accents, background noise, etc, Given that our collection is from 100 different local tv and radio stations across the country, the variety of audio and audio quality varies widely. Some programs are in Spanish, some are musical performances, and nearly all begin with standard bars and tone for video recordings. The speech to text tool tries to interpret these sounds as text, and it makes a number of other mistakes too. WGBH has created a web based game to allow the public to help us fix and correct these transcripts. You are welcome to follow long with me if you have a computer as I walk through the game, and encouraged to play afterwards.
The game has a terms of use that we need players to check off to make sure they understand that they can not use the content for anything but helping us correct the transcripts. We’ve kept the clips to only 5 mins in order to be able to take advantage of fair use.
There are 3 games you can play – identify errors, suggest fixes, and validate fixes. You gain points for each action taken.
You can set preferences on the type of content you would like to interact with. Or you can pick which station’s content you would like to work on. We are hoping to perhaps get stations to compete with each by getting their station volunteers and community to play against each other for more points. But we need to do a bit more development for that.
Each iteration of a game lasts 5 minutes. But you can play multiple times for any length of time. Three lines of the transcript are active at once. You listen to the audio, see the line highlighted and click on it if there is a mistake. There are instructions and guides on what is considered an error and how to mark it. It take a little bit to figure it out, but after a few times you can pick it up pretty quickly.
Game 2 you correct things that have been tagged as an error or mark it as not an error.
And game 3 you validate the corrections that have been made. You are given choices that have been fixed to pick the correct one.
The game board keeps track of points and players. And highlights top scorers. Studies have shown that people play these games for personal satisfaction and a competition doesn’t necessarily increase the desire to play. We hope people will be driven just by the personal satisfaction of getting points and helping us out as oppose to competing against anyone in particular.
Once the transcripts have been verified, the JSON transcripts will be stored in the AAPB’s Amazon S3 account and indexed for keyword searching on the AAPB website. The transcripts will be made available alongside the media on the record page. They can also be played like captions within the video player. And they will be able to be harvested via an API to be used as a data set for research. We are hoping that researchers will begin to look at the collection as a data set and start trying to see trends from programming over the last 60 years. Particularly across news programs.
Be sure to play and tell all your friends about it.

Using Computational Tools and Crowdsourcing Games to Increase Metadata and Discoverability of Digital Collections

Recommended

Recommended

More Related Content

Similar to Using Computational Tools and Crowdsourcing Games to Increase Metadata and Discoverability of Digital Collections

Similar to Using Computational Tools and Crowdsourcing Games to Increase Metadata and Discoverability of Digital Collections (20)

More from WGBH Media Library and Archives

More from WGBH Media Library and Archives (20)

Recently uploaded

Recently uploaded (20)

Using Computational Tools and Crowdsourcing Games to Increase Metadata and Discoverability of Digital Collections

Editor's Notes