SlideShare a Scribd company logo
1 of 38
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• ‫والدیمیر‬‫پوتین‬
• ვლადიმერ პუტინი
•
• Վլադիմիր Պուտին
AI/ML
”
•
•
•
•
predicate
author of
http://id.loc.gov/vocabulary/relators/aut
Jennifer Vaughn; Brad Campbell: Metadata for Audiovisual Broadcast Materials: Challenges and Opportunities
Jennifer Vaughn; Brad Campbell: Metadata for Audiovisual Broadcast Materials: Challenges and Opportunities
Jennifer Vaughn; Brad Campbell: Metadata for Audiovisual Broadcast Materials: Challenges and Opportunities

More Related Content

More from ÚISK FF UK

More from ÚISK FF UK (20)

Jak na video?
Jak na video? Jak na video?
Jak na video?
 
Marie Balíková: Databáze věcných autorit
Marie Balíková: Databáze věcných autoritMarie Balíková: Databáze věcných autorit
Marie Balíková: Databáze věcných autorit
 
Eva Lesenková: Zdravotní gramotnost : Jak můžeme lépe získat informace o zdraví?
Eva Lesenková: Zdravotní gramotnost : Jak můžeme lépe získat informace o zdraví?Eva Lesenková: Zdravotní gramotnost : Jak můžeme lépe získat informace o zdraví?
Eva Lesenková: Zdravotní gramotnost : Jak můžeme lépe získat informace o zdraví?
 
Anna Hoťová: Školní knihovny
Anna Hoťová: Školní knihovnyAnna Hoťová: Školní knihovny
Anna Hoťová: Školní knihovny
 
Magdalena Paul: Fake news
Magdalena Paul: Fake newsMagdalena Paul: Fake news
Magdalena Paul: Fake news
 
Rudolf Rosa: Milníky umělé inteligence
Rudolf Rosa: Milníky umělé inteligenceRudolf Rosa: Milníky umělé inteligence
Rudolf Rosa: Milníky umělé inteligence
 
Pavel Berounský: Prohlídka datacentra Kokura (18. 10. 2021)
Pavel Berounský: Prohlídka datacentra Kokura (18. 10. 2021) Pavel Berounský: Prohlídka datacentra Kokura (18. 10. 2021)
Pavel Berounský: Prohlídka datacentra Kokura (18. 10. 2021)
 
Pavel Herout: Datová centra (18. 10. 2021)
Pavel Herout: Datová centra (18. 10. 2021)Pavel Herout: Datová centra (18. 10. 2021)
Pavel Herout: Datová centra (18. 10. 2021)
 
Anna Štičková: Čuchni ke knize
Anna Štičková: Čuchni ke knizeAnna Štičková: Čuchni ke knize
Anna Štičková: Čuchni ke knize
 
Hana Šandová: Centrum technického vzdělávání Půda jako třetí oddělení knihovny
Hana Šandová: Centrum technického vzdělávání Půda jako třetí oddělení knihovnyHana Šandová: Centrum technického vzdělávání Půda jako třetí oddělení knihovny
Hana Šandová: Centrum technického vzdělávání Půda jako třetí oddělení knihovny
 
Open data (Civic Tech)
Open data (Civic Tech) Open data (Civic Tech)
Open data (Civic Tech)
 
Vojtěch Ripka: Taking Mediality Seriously
Vojtěch Ripka: Taking Mediality SeriouslyVojtěch Ripka: Taking Mediality Seriously
Vojtěch Ripka: Taking Mediality Seriously
 
Tereza Simandlová: Open science v prostředí akademických knihoven: nová výzva...
Tereza Simandlová: Open science v prostředí akademických knihoven: nová výzva...Tereza Simandlová: Open science v prostředí akademických knihoven: nová výzva...
Tereza Simandlová: Open science v prostředí akademických knihoven: nová výzva...
 
Anna Hejlkova: Reprezentace historie ve videohrách: případová studie hry “Ki...
Anna Hejlkova:  Reprezentace historie ve videohrách: případová studie hry “Ki...Anna Hejlkova:  Reprezentace historie ve videohrách: případová studie hry “Ki...
Anna Hejlkova: Reprezentace historie ve videohrách: případová studie hry “Ki...
 
Sven Ubik: Distanční spolupráce v živé kultuře
Sven Ubik: Distanční spolupráce v živé kultuřeSven Ubik: Distanční spolupráce v živé kultuře
Sven Ubik: Distanční spolupráce v živé kultuře
 
Robin Kopecký: Pokusní králíci
Robin Kopecký: Pokusní králíci Robin Kopecký: Pokusní králíci
Robin Kopecký: Pokusní králíci
 
Nina Wančová: Vývoj softwaru pro virtuální výstavy
Nina Wančová: Vývoj softwaru pro virtuální výstavyNina Wančová: Vývoj softwaru pro virtuální výstavy
Nina Wančová: Vývoj softwaru pro virtuální výstavy
 
David Novák: Historické vědy a počítačové zpracování dat
David Novák: Historické vědy a počítačové zpracování datDavid Novák: Historické vědy a počítačové zpracování dat
David Novák: Historické vědy a počítačové zpracování dat
 
Stanislav Velčev: TEATER: setkání na půl cesty mezi knihovnictvím a archeologií
Stanislav Velčev: TEATER: setkání na půl cesty mezi knihovnictvím a archeologiíStanislav Velčev: TEATER: setkání na půl cesty mezi knihovnictvím a archeologií
Stanislav Velčev: TEATER: setkání na půl cesty mezi knihovnictvím a archeologií
 
Radim Hladík: Úvod do slovních vektorů pro humanitní a sociální vědce
Radim Hladík: Úvod do slovních vektorů pro humanitní a sociální vědceRadim Hladík: Úvod do slovních vektorů pro humanitní a sociální vědce
Radim Hladík: Úvod do slovních vektorů pro humanitní a sociální vědce
 

Jennifer Vaughn; Brad Campbell: Metadata for Audiovisual Broadcast Materials: Challenges and Opportunities

Editor's Notes

  1. RFERL Mission: https://pressroom.rferl.org/mission-statement-principles-of-ethical-journalism
  2. See https://www.hoover.org/sites/default/files/library/docs/story_of_radio_free_europe.pdf Photos from Hoover Institute collection.
  3. Collection at: https://www.hoover.org/library-archives/collections/radio-free-europeradio-liberty-records https://pressroom.rferl.org/historical-archives
  4. https://pressroom.rferl.org/rferl-language-services
  5. Individual services have successfully achieved their goals of serving as a surrogate free press for their home countries. This autonomy, and many other complicated organizational factors, led over time to divergent practices. Effective infrastructure is needed to connect individual units in the production chain. By linking systems and subsystems, the main phases of the broadcasting process--including planning, production, postproduction, distribution, storage, and archiving--will form the component parts of a virtual integrated system. We believe that “the road to well-organized production, storage, and exchange of digital media content is paved with metadata. Metadata are the single most important instrument to realize an effective and consistent audiovisual production environment.” (De Jong, Annemieke, and Robert Egeter van Kuyk. Metadata in the audiovisual production environment: an introduction. Netherlands Audiovisual Archive, 2000, p. 6) First, we will look at the unique metadata needs of audio-visual content and present an overview of the existing metadata standards for broadcast media. Then we will present some of the work we have done to address RFERL’s unique challenges through structured metadata and talk about future plans for linked data.
  6. I am sure I don’t need to tell an audience of library science graduate students what metadata is or why it is essential to provide access and retrieval of library or archive items, but it is also extremely important in every stage of broadcasting. Metadata is created throughout the planning, production, and distribution of a news product in the broadcasting industry, and is the fuel for storing, searching for and re-using content in a broadcast archive.   Most graduate librarianship programs preference bibliographic cataloging rules, systems, and standards like UNIMARC. We primarily learn to catalog books, and how to manage text-based collections. Real library collections are heterogenous, and while newer library cataloging systems and standards account for some of this complexity, most still contain a bias toward books.
  7. In order to create good metadata and to effectively manage collections, we and by extension our standards, systems, and practices, should understand and accommodate the unique needs of audiovisual content. From "BIBFRAME AV Modeling Study: Defining a Flexible Model for Description of Audiovisual Resources.“ Descriptive standards for archival content (DACS, EAD) also do not necessarily meet the needs of media archives. They hierarchically focus on collection-level records, while users of broadcaster archives often need item-level, or even more granular, access to materials.
  8. Users of broadcast archives can be employees of a broadcaster like archivists, video editors, producers, and journalists; but also historians, documentary filmmakers, outside researchers, and other broadcast organizations. IFLA’s Library Reference Model describes the fundamental user tasks that good library data should enable, regardless of the system where it lives. Certainly, use cases for broadcast archives users include all the iFLA user tasks, and are frequently more complex and specialized.
  9. For a journalist to accurately identify whether a resource satisfies their information needs, a low-resolution copy needs to be generated and displayed within the system for viewing. Complex licensing, rights, and re-use information is also often needed to understand whether the resource can be re-used, while a simple copyright date often suffices for published materials in library catalogs. While library catalogs might send us to a particular shelf or drawer, or provide a web link, broadcast media systems should be able to retrieve resources from online, nearline and offline storage levels. Media archivists also expect to be able to perform technical tasks, including re-encoding files individually or in bulk to different formats and performing other preservation administration tasks.
  10. Adoption and use of standards is our default as librarians and archivists-we often even help create and implement the standards. This is less often the case in broadcast media archives. Proprietary systems are not just for archivists—they typically include several components meant for various stages of production and ultimately archive. These systems are often opaque—we cannot see if or what standard they employ. And obtaining and implementing the technology can seem more important than understanding how it works and what the long-term implications are. We have deadlines and breaking news, after all. So, we often find ourselves in the position of advocating metadata standards to journalists and management at RFE. Standards initially can take more work to understand, plan for, and implement. But the long-term costs of creating one-off metadata schemes and the resulting loss of functionality are significant. We don’t need to reinvent the wheel. Using standards promotes consistency and optimal functionality across systems. Standards provide the basis for institutional memory where practices and semantics are well-understood and documented. Standards are updated and maintained by professional communities, their governing bodies often provide training and educational material, and they plan for upgrades, interoperability with other standards, new integrations and other change.   Of course, adoption of and adherence to standards is relative. Individual organizations employ standards to different extents, and to the best of their ability, understanding, and applicability. Standards won’t solve every problem, but they provide a stable foundation for our systems and give us a common vocabulary and understanding of the broadcasting universe.   There are dozens of metadata standards for broadcasting, but we will focus on the two major open broadcasting metadata standards, plus their common ancestor.
  11. https://www.dublincore.org/about/ https://www.dublincore.org/specifications/dublin-core/ DC’s data model includes a “resource” class. The resource represents the media item in the abstract. Resources typically require descriptive details, like the title, subjects, and description. DC also has a “manifestation” class, which holds the format and other technical characteristics of a resource ‘manifested’ somehow: i.e. in a digital or physical copy. DC utilizes the “one-to-one" principle, which states that each resource can only have a single manifestation. This is an unrealistic constraint for broadcasters, as it does not reflect our reality of having many copies or versions with the same intellectual content but different formats or technical details. For discussion of DC one-to-one principle, see Miller, Steven J. "The one-to-one principle: challenges in current practice." International Conference on Dublin Core and Metadata Applications. 2010.
  12. See http://pbcore.org/what-is-pbcore The PBCore metadata standard (Public Broadcasting Metadata Dictionary) was created by the public broadcasting community in the United States for use by public broadcasters and related communities that manage audiovisual assets, including libraries and archives. PBCORE is built on Dublin Core, but its schema restructures the relationship between resource and manifestation and allows an intellectual asset to have an unlimited number of versions. It has a relatively simple group of classes and elements, but includes more broadcast-specific elements like Genre, and it also comes with several controlled vocabularies that describe types of media assets, versions, and formats. To address local needs, it also allows “extensions,” which are chunks of XML metadata from another standard, like PREMIS or METS.
  13. EBUCore is also based on Dublin Core, but has been extended for media. It contains many more elements than PBCore but was designed with flexibility and customization in mind. It abstracts into classes different processes, formats, and types of media resources into combinations that work for individual broadcasters. The EBU Class Conceptual Data Model, or CCDM, was also developed for modelling business processes through different domains. Both standards are updated regularly. We have learned a lot about both standards from several European broadcasters who have successfully implemented EBUCore and CCDM. At RFE, we were drawn to the modeling capabilities and the RDF structure of the EBU’s standards (more on RDF later), and to the simplicity of PBCore and its free software tools. See: Evain, Jean-Pierre. "Semantic technologies in broadcasting production." 2014 10th International Conference on Semantics, Knowledge and Grids. IEEE, 2014. https://tech.ebu.ch/MetadataEbuCore https://www.ebu.ch/metadata/ontologies/ebucore/ https://www.ebu.ch/metadata/ontologies/ebuccdm/
  14. In 2018, we used EBUCore and CCDM to create a conceptual data model for RFERL’s production processes. We did not use all the available domains or classes but tried to accurately model how we wanted to create, track, and move assets through a production cycle through to archive and re-use.
  15. In 2019, we decided to use PBCore’s open source and freely available Cataloging Tool in order to standardize and enhance our internal metadata practices. We are expecting a large-scale migration of assets and data into a new system in the next year or so. We both wanted to improve the granularity and quality of our metadata now, and to prepare it for migration and mapping to a new system. We still plan to use EBUCore/CCDM for our new archive system, but with the existing mapping of EBUCore and PBCore this should be achievable. Though PBCore is relatively simple, there was a lot of work involved with moving from NO standard to PBCore standard. Its integrated controlled vocabularies work relatively well for our processes, though we had to “curate” the lists for our use, create mappings for our local common terminology, and write up internal documentation. We have imported some external vocabularies as well into the software, but the software is too lightweight to and simple to deal with the structured knowledge base we want to use for most of our subjects, names, and place references.
  16. There are many proprietary news-specific subject vocabularies for the news industry, and some open source ones as well. We have assessed several, and while each has its strengths, they often contain many terms that we don’t use, and not enough of the ones we do need. We want to accommodate as many of our language services as possible, but building a custom multilingual taxonomy is extremely complicated and hard to maintain. As with our overall metadata structure, we wanted to leverage existing work.
  17. When we can implement more centralized and integrated search tools for locating our content, we will have to present a complex multilingual environment in a manageable way. We wouldn’t expect even our brilliant journalists to think to search for all the footage showing Vladimir Putin in these scripts and languages:   We value the authority work that librarians perform, but don’t have resources to devote to it at RFE. Instead, we have explored the idea of using something a bit different as our “knowledge base,” and other academic libraries, broadcasters, and organizations are also finding it very useful. Instead of referring to Putin by a series of characters in any number of scripts and languages, wouldn’t it be more efficient to refer to him via a Uniform Resource Locator?
  18. See https://www.wikidata.org/wiki/Wikidata:Introduction
  19. Wikidata connects equivalent concepts from different language versions of Wikipedia. We can type any of Douglas Adam’s aliases into the search box, any language or script, and this Wikidata item will come up.   Wikidata contains concepts, entities, places, and events as well as names—really anything that is mentioned in Wikipedia. It has an API, so it’s possible to build applications that harness Wikidata metadata. Our RFE Video Archives hopes to use this API to help “tag” people in our content with Wikidata Q numbers, which would solve some of our access and retrieval issues.   It is not a perfect solution: we will also have to maintain a local database of people, places, events, and ideas that are not notable enough for Wikidata. We need to develop workflows to create and add to Wikidata pages. And Wikidata sometimes has inaccuracies. Pages in different languages can reflect political disagreements or regional biases. But our analyses show that it has 80-90% of the terms that RFERL needs for its video.
  20. See https://www.gartner.com/en/research/methodologies/gartner-hype-cycle
  21. Artificial Intelligence and Machine learning technologies are currently at the peak of the Gartner hype cycle. AI/ML dominates articles, online discussions and tech blogs, conference presentations and trade shows. The hype often leads to an unrealistic view that AI is a magical solution that will solve all our problems and replace the need for human archivists. There are some mature AI & ML technologies: fairly accurate transcripts of spoken language can be created, translation capability seems to improve daily, onscreen images can be identified and labeled, and text containing unstructured natural language can be mined and indexed. All these technologies can free up workers from time-consuming and tedious work. Though we have tested ML/AI technologies and are very interested in their potential, we are still laying the foundations to increase our likelihood of good data governance and information management. We are warily watching broadcasters who are eagerly purchasing the tech without a good information map.
  22. Evain, J.P. and Rebecca Fraimow, 2019, Core developments in audiovisual metadata: A standards update, IASA Conference, Hilversum, Netherlands, October. Jean Pierre Evain, the Director of Technology and Innovation at the EBU, states very directly that if you didn’t set up your metadata correctly in the first place, AI is not going to solve your problems, but merely create more disconnected, “siloed” data. Evain has spent years advocating for metadata standards that have semantic web and linked data potential. Machine learning applications create a ton of metadata, and it needs to live somewhere useful. An interesting aside is that these applications ultimately recreate metadata that was lost during the production process, so what if there were a way to reduce this redundancy and save AI/ML for novel and useful purposes?
  23. At their own hype peak several years ago, Linked Data and the Semantic Web are finally approaching a stage of useful and functional development. A somewhat overly ambitious initial plan to connect all information on the Internet via linked data has gradually relaxed into a set of useful tools that are well-suited for our work at RFE.
  24. Linked Data and Semantic Web technologies are quite complex and deserve their own semester-long classes. But we will spend the last few minutes of our presentation today discussing the underpinning of it all: Resource Description Framework, or RDF.   Today, we have already sneakily mentioned several things that have emerged from the semantic web and linked data movements: Wikidata, the online EBU ontologies, and even Dublin Core properties.   See: https://www.w3.org/RDF/
  25. RDF is based on the idea of making “statements” about resources, known as triples. A statement is like a simple sentence containing three parts. Concrete examples of triples.
  26. We think the complex and varied interrelationships of broadcast media metadata are better represented with RDF. Data describing these relationships can be interconnected, added to, and queried. Here, we have loosely modeled an RFE story into a graph with strings. Then we’ve abstracted the strings with several URLs instead of natural language. RDF can exist on the small scale, or the very large scale. With databases structured with RDF, we can transform and then combine metadata from different sources into one big “lake” of data. Among other things, this will give us a new way to query our data. Data journalism, largescale data analyses, and many other insights would be possible. Our efficiency, re-use, and sharing of content would also be greatly enhanced. And these benefits could occur even before we invest in AI and ML technologies.   We believe that the time we’ve devoted to considering our possibilities and modeling information and processes will improve how we produce news at RFERL.