Multilingualism ifla 2014 08


Published on

OCLC's 3 overlapping projects aim to generate true multi-lingual displays and to generate translation records for sharing via VIAF.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This project seeks to leverage the multilingual content of WorldCat. There are more than 300 million records in WorldCat today representing holdings of the world’s libraries. There are also another 200 million articles. Of the 300 million, more than half are in languages other than English.
  • The 300 million records are clustered into works. The records may be in just one language of cataloguing or they may have a mixture of language of cataloguing, e.g. subject heading in more than one language. Parallel records may exist, i.e. records in different languages of cataloguing describing the same resource.
  • The existing architecture is bibliographic record centric. There may be links to author and subject authority records. The work, manifestation and content clusters only contain identifiers, no metadata at present.
  • Three complementary initiatives are in progress concerning multi-lingualism and WorldCat.
  • The first project – generation of metadata at work level to make better presentations. Implementation Q1 2015.
  • The GLIMIR project creates clusters of parallel records for the same manifestation (manifestation cluster) and also clusters for the same content, though the form may be different (print, microform, digital). We are in the process of trawling thorugh the database to make these clusters.

    Only the part in blue will be changed by the multi-lingual bibliographic structure approach
  • Image from Connexion client that shows the short list selection benefing from GLIMIR. So far 103 million records have GLIMIR Ids.
  • Different records in different languages of cataloguing are clustered. The « D »in the last line indicates an LC record. The manifestation cluster includes different records in different languages of cataloguing for the same manifestation. It also includes apparent duplicates that the de-duplication program was not able to confidently merge.
  • Searching with SRU it is possible to search on workID, Content ID (Reproduction1) and manifestation ID. During the development of GLIMIR it was determined that it would be possible in some cases to identify equal content and thus present originals, micorform,re-prints, and electronic editions clustered together.
  • Concerning correcting records coded as “und” = “undetermined”, we expect that we can correct about 7 million by searching for the same title string in other records.
  • Currently when the short list and full displays are created, the system selects the most appropriate record for display. The most appropriate record is determined by its size and the number of associated holdings and it was envisaged to extend this to include the language of cataloguing. But we think we can do much better by questioning the “most appropriate record” concept.
  • Here the first record is catalogued in German but has no significant German content – no subject headings and no notes. There are other richer records that could inform the user better.
  • Hyrid records also mean that significant linguistic information may be buried in other records.
  • This record has subject headings in 3 languages.
  • Build the information to display to users from all available records and show them all relevant holdings. Do not display just one selected record from the work set. Cataloguers too will benefit from this, being able to drill down to actual records where appropriate.
  • This is the theoretical bibliographic structure that will no longer be bibliographic record centric but work centric. All information – authors, notes, summaries, subject headings will be flagged with the language of cataloguing.
  • The work level metadata will be tagged with language for each data element.
    Instead of always showing the data from the main title in the record, the alternative script fields may be chosen for display, depending on what the system can determine about the user, e.g. from IP address range or expressed preference. already includes the ability to change language of display, but the numbers of fields that change will be enhanced and the tabulation of several displays will be improved.
    Consolidate holdings from all records that re applicable to a display (work, content, manifestation level)
  • An Iceland Fisherman, original in French displayed in English UI with Translations table.
  • Same record with Chinese interface
  • The Grand Design; English in Italian UI with Translations table.
  • Same in Japanese interface
  • Instead of working directly with lesser quality records to improve the quality in WorldCat and instead of working with the long tail, we are turning our attention to the most important works and working ways to use the good records to improve the quality.
  • And here’s one title written by each of those “cream” authors – in the original language and script. Likely you’ve heard of all or most of them? But probably in English (click).
    Acknowledgments: Karen Smith-Yoshimura
  • Acknowledgment: Karen Smith-Yoshimura
    We have been focusing on the content that is most likely to be of interest to the most people – translations.
    The cream of the world’s cultural and knowledge heritage is shared by being translated and WorldCat contains many rich cataloguing records for these translations. The Virtual International Authority File, an aggregation of authority records from over 30 agencies worldwide, identifies 15 million unique persons. When we datamine WorldCat, only 7% have written works that have been translated into at least one other language. Only 7,000 have had their works translated into 10 or more languages. This is the “short head” of works that have the most impact on readers worldwide.
  • Translated titles not always consistent, causing work grouping failure.  Sometimes:
    caused by titles without sub titles,
    caused by different forms of uniform title, i.e. in Gujarati and in English (several forms)
    caused by inverting the titles,
    by placing the name Gandhi before “Autobiography”. 

    Some figures:  French – 15 records, 6 work sets; German -  9 records, 9 work sets; Italian 7 records, 5 work sets; Spanish 8 records, 4 work sets
  • We have been datamining WorldCat to generate work-translation ("expression level") records—including the translated title and translator with links to the original work and the author—and adding them to VIAF (Virtual International Authority File), flagged as "xR". Here is a concrete example of the impact – Jane Austen’s Pride and Prejudice originally had 13 translations associated with it; after we generated the xR records for other translations as datamined from WorldCat the number increased to 50.

    So far we have added 2 million xR records to VIAF. About the same as the UNESCO database after 25 years (!)
  • UNESCO has also been interested in identifying and aggregating translations world-wide for an “international bibliography of translations” – they have been at it for over 80 years! Their database represents entries contributed from UNESCO members since 1979 and contains 2 million entries.
  • Yet when we compare the results for a specific work, like Anatole France’s Crime de Silvestre Bonnard, we can see that the contributions OCLC member libraries have made, through WorldCat, is far greater that all the work invested so far by the UNESCO member states. VIAF, with the xR records datamined from libraries’ contributions to WorldCat supplementing those from VIAF contributors, lists 29 translations in 28 different languages.
  • We are now working on enhancing the VIAF display to make the contents of title records more visible. This is a mockup of a VIAF Consolidated display – shows work with expressions summary. It shows the title as it is translated in each language.
  • This is a mockup of the full expressions summary for the work Pêcheur d’Islande by Pierre Loti.
  • This is a mockup of a VIAF Full display – showing different translations of the title, plus the translator and the earliest determined publication date of each.
  • Acknowlegments: Karen Smith-Yoshimura & Dan Benson
    The relationship of a work (with an author) and its associated translations (with their respective translators) is relatively straight-forward. This diagram captures the relationship in two reciprocal links, one from the original Chinese book (blue box) through its property HasTranslation pointing (with blue arrows) to each translation (red boxes), and one from each translation back to the original Chinese work (red arrows) through the property IsTranslationOf . A work can have any number of translations, and there can be multiple translations into the same language which is why identifying the translator is so important.

  • Acknowledgements: Karen Smith-Yoshimura
    Machines access VIAF far more than humans. To leverage all the work done by the OCLC cooperative we want to share the relationships we’ve established between original works and their associated translations with the semantic Web. Here is a sample markup of an original Chinese work written by Gao Xingjian, a Chinese Nobel Prize laureate for literature, and one of the translations of his work into English. We marked this up with but there are two new terms we are proposing, shown here: translator and translationOfWork.
  • Acknowledgements: Karen Smith-Yoshimura

    This work will help us better understand the extent to which information is shared across cultures. Here are some questions we should be able to answer soon!

    More information on this project is available on the OCLC Research website at the link shown.

    Just think. How many of the 150+ million records that are for non English language works are actually translations of English language records. And how many of the English language records are also translations? It could be as high as 25%?

    Once we have these records generated, it will open many new possibilities.

    Also, we know we have 300 million records, but how many real resources do we have? GLIMIR will produce these figures.

    We are just starting…
  • VIAF expression level displays will include GLIMIR Ids.
  • Multilingualism ifla 2014 08

    1. 1. IFLA - Lyon, France 19 August 2014 Multilingualism in WorldCat and VIAF Janifer Gatenby Working with Karen Smith-Yoshimura, Robert Bremer, Eric Childress, Jean Godby, Richard Greene, JD Shipengrover, Gail Thornburg, Jenny Toves, Diane Vizine Goetz, Shenghui Wang, Jay Weitz
    2. 2. WorldCat Today • Resources in nearly all languages • Contributed by more than 20,000 libraries worldwide • More than half the database is for works not in English Languages English German French Spanish Chinese Dutch Japanese Russian Arabic 469 others
    3. 3. WorldCat Today • Bibliographic Records – Hybrid records – Parallel records • Clustered at Work level (FRBR)
    4. 4. Existing Architecture Authors Authors Authors Subj Classif Subj Classif Subj Classif Holding Holding Holdings Bibliographic record Work cluster Content cluster Manifes tation cluster
    5. 5. Complementary Initiatives Work Level Record GLIMIR Manifestation & Content Clusters Multi-lingual Bibliographic Structure
    6. 6. Create a consolidated metadata summary for the content of a work Objective: Work Level Record
    7. 7. Work Level Record Coming Q1 2015
    8. 8. Create better work presentations GLIMIR: Objective
    9. 9. Users like C • The Content Cluster GLIMIR – Enables better work record displays by reducing the number of lines that display for large works – Enables a choice of format and presents the formats that could be acceptable substitutes – Consolidates holdings for identical content • The Manifestation Cluster is important – Consolidates holdings at manifestation level – In the short term allows the record catalogued in the language of the interface to be chosen for display – Reduces apparent duplication – Allows a more accurate count of the number of manifestations in WorldCat (as opposed to the number of records) Cataloguers & scholars like C
    10. 10. Manifestation Clustering So far 103 million records processed (about 30%)
    11. 11. Manifestation Cluster Opened
    12. 12. SRU Search: Loti Pêcheur d’islande (Work ID 21536567) Records Holdings Work 18 148 Content 14 143 Manifestation 7 115
    13. 13. Multilingual Bibliographic Structure Project Objective: Improve displays; surface translations
    14. 14. Multilingual Bibliographic Structure Project Creates true multi-lingual displays – At work and manifestation levels – Using all available data instead of “most appropriate record” – Generates data Corrects many of the 28 million records coded “und” Better control and linking of translations Input to refinement of work clusters Smarter data storage
    15. 15. “Most appropriate” questioned • selects the most appropriate record to show to a user as representative of the work in the short result list and beyond • The end result will not be very satisfactory from a multi-lingual viewpoint… here’s why
    16. 16. Which record is better to present to a German speaker?
    17. 17. Incomplete Swedish Record
    18. 18. Hybrid record
    19. 19. Most appropriate display Build the display from all available data
    20. 20. Multilingual Bibliographic Structure Project • Work level data, mined from all associated bibliographic records will be displayed supplemented with expression / manifestation level data as the user drills through the short to fuller versions of the metadata. End user interface will show works and manifestations not bibliographic records; the cataloguing client will also show bibliographic records
    21. 21. Proposed new architecture jpn Work eng fre ger jpn Manif eng Manif eng Manif eng Manif eng Manif Manif eng engA oN oftrees Contents ++ Holding Holding Holding Subj Classif Holding Subj sif eng fre ger jpn Authors Authors eng Authors fre ger eng fre ger jpn fre eng ger jpn Translations (Language of work) Manif fre Holding
    22. 22. Important principles • Language tagging of elements, particularly – Summaries (M21 520) – Subject headings • Display in script preferred by the user if data is available • Improve translated interfaces • Show consolidated holdings as appropriate
    23. 23. Translations Surfacing the “cream”
    24. 24. Great works are translated • The cream of the world’s cultural and knowledge heritage is shared by being translated • WorldCat contains many rich cataloguing records for these translations GOAL: Data mine the really good records to improve clustering, presentation, authority records and linked data
    25. 25. Ιλιάδα The Iliad 紅樓夢 Dream of the Red Chamber ঘরে বাইরে The Home and the World زقاق المدق Midaq Alley Война и миръ War and Peace The Tale of Genji דער בעל-תשובה The Penitent સત્યના પ્રયોગો અથવા આત્મકથા 源 氏 物 語 The Story of My Experiments with Truth [Gandhi autobiography]
    26. 26. Translations Leo Tolstoy: 32 languages Homer: 28 languages Rabindranath Tagore: 21 Isaac Bashevis Singer: 17 Najīb Maḥfūẓ: 12 languages Cao Xueqin: 9 languages Mahatma Gandhi: 7 languages Murasaki Shikabu: 7 languages
    27. 27. Improving work clustering • Inconsistencies cause work clusters to be incomplete resulting in less than optimal search results – Titles without subtitles – Missing or different forms of uniform title – Inverted title – Different coding of original and translated information Generated uniform title authority records will overcome most of these differences without needing to edit individual records
    28. 28. Addition of xR records to VIAF Before After
    29. 29. UNESCO Translation Database
    30. 30. XR VIAF Record VIAF ID for Author Translated title Translator
    31. 31. IFLA - Lyon, France 19 August 2014 VIAF Linked Data New Information
    32. 32. Title: 西遊記 Language: Chinese Author: 吳承恩 Created: 1592 HasTranslation: Title: Journey to the West Language: English Translator: Anthony C. Yu Date: 1977 IsTranslationOf: Title: Journey to the West Language: English Translator: W. J. F. Jenner Date: 1982-1984 IsTranslationOf: Title: Tây du ký bình khảo Language: Vietnamese Translator: Phan Quân Date: 1980 IsTranslationOf: Title: Monkeys Pilgerfahrt Language: German Translator: Georgette Boner Date: 1983 IsTranslationOf: Title: 西遊記 Language: Japanese Translator: 中野美代子 Date: 1986 IsTranslationOf:
    33. 33. Markup for the Semantic Web # Original Work (in Chinese) <> a schema:CreativeWork; schema:creator <> ; # "Gao, Xingjian” schema:inLanguage "zh"; schema:name "靈山"@zh; . # Translated Work (in English) <> a schema:CreativeWork; schema:creator <> ; # "Gao, Xingjian“ [new]:translator <> ; # "Lee, Mabel" schema:inLanguage "en"; schema:name "Soul Mountain"@en ; [new]:translationOfWork <> “
    34. 34. Understanding information sharing across cultures • What percentage of non-English works are translations of English works, and vice-versa? • Which authors are translated the most? • Which works have been translated into the most languages? • Which countries translate the most English works, the most non-English works? • Which countries translate a new work the fastest? Etc.
    35. 35. Where are we now? Clustering • Work clusters done; ongoing refinement • GLIMIR clustering done for all [simple] text; – 103 million records have GLIMIR IDs • Working on collected works Displays • Working on VIAF expression displays • Work level displays in ++ Data Mining for translations
    36. 36. Janifer Gatenby EMEA Program Manager Metadata Explore. Share. Magnify.