This project seeks to leverage the multilingual content of WorldCat. There are more than 300 million records in WorldCat today representing holdings of the world’s libraries. There are also another 200 million articles. Of the 300 million, more than half are in languages other than English.
The 300 million records are clustered into works. The records may be in just one language of cataloguing or they may have a mixture of language of cataloguing, e.g. subject heading in more than one language. Parallel records may exist, i.e. records in different languages of cataloguing describing the same resource.
The existing architecture is bibliographic record centric. There may be links to author and subject authority records. The work, manifestation and content clusters only contain identifiers, no metadata at present.
Three complementary initiatives are in progress concerning multi-lingualism and WorldCat.
The first project – generation of metadata at work level to make better presentations. Implementation Q1 2015.
The GLIMIR project creates clusters of parallel records for the same manifestation (manifestation cluster) and also clusters for the same content, though the form may be different (print, microform, digital). We are in the process of trawling thorugh the database to make these clusters.
Only the part in blue will be changed by the multi-lingual bibliographic structure approach
Image from Connexion client that shows the short list selection benefing from GLIMIR. So far 103 million records have GLIMIR Ids.
Different records in different languages of cataloguing are clustered. The « D »in the last line indicates an LC record. The manifestation cluster includes different records in different languages of cataloguing for the same manifestation. It also includes apparent duplicates that the de-duplication program was not able to confidently merge.
Searching with SRU it is possible to search on workID, Content ID (Reproduction1) and manifestation ID. During the development of GLIMIR it was determined that it would be possible in some cases to identify equal content and thus present originals, micorform,re-prints, and electronic editions clustered together.
Concerning correcting records coded as “und” = “undetermined”, we expect that we can correct about 7 million by searching for the same title string in other records.
Currently when the short list and full displays are created, the system selects the most appropriate record for display. The most appropriate record is determined by its size and the number of associated holdings and it was envisaged to extend this to include the language of cataloguing. But we think we can do much better by questioning the “most appropriate record” concept.
Here the first record is catalogued in German but has no significant German content – no subject headings and no notes. There are other richer records that could inform the user better.
Hyrid records also mean that significant linguistic information may be buried in other records.
This record has subject headings in 3 languages.
Build the information to display to users from all available records and show them all relevant holdings. Do not display just one selected record from the work set. Cataloguers too will benefit from this, being able to drill down to actual records where appropriate.
This is the theoretical bibliographic structure that will no longer be bibliographic record centric but work centric. All information – authors, notes, summaries, subject headings will be flagged with the language of cataloguing.
The work level metadata will be tagged with language for each data element. Instead of always showing the data from the main title in the record, the alternative script fields may be chosen for display, depending on what the system can determine about the user, e.g. from IP address range or expressed preference. Worldcat.org already includes the ability to change language of display, but the numbers of fields that change will be enhanced and the tabulation of several displays will be improved. Consolidate holdings from all records that re applicable to a display (work, content, manifestation level)
An Iceland Fisherman, original in French displayed in English UI with Translations table.
Same record with Chinese interface
The Grand Design; English in Italian UI with Translations table.
Same in Japanese interface
Instead of working directly with lesser quality records to improve the quality in WorldCat and instead of working with the long tail, we are turning our attention to the most important works and working ways to use the good records to improve the quality.
And here’s one title written by each of those “cream” authors – in the original language and script. Likely you’ve heard of all or most of them? But probably in English (click). Acknowledgments: Karen Smith-Yoshimura
Acknowledgment: Karen Smith-Yoshimura We have been focusing on the content that is most likely to be of interest to the most people – translations. The cream of the world’s cultural and knowledge heritage is shared by being translated and WorldCat contains many rich cataloguing records for these translations. The Virtual International Authority File, an aggregation of authority records from over 30 agencies worldwide, identifies 15 million unique persons. When we datamine WorldCat, only 7% have written works that have been translated into at least one other language. Only 7,000 have had their works translated into 10 or more languages. This is the “short head” of works that have the most impact on readers worldwide.
Translated titles not always consistent, causing work grouping failure. Sometimes: caused by titles without sub titles, caused by different forms of uniform title, i.e. in Gujarati and in English (several forms) caused by inverting the titles, by placing the name Gandhi before “Autobiography”.
Some figures: French – 15 records, 6 work sets; German - 9 records, 9 work sets; Italian 7 records, 5 work sets; Spanish 8 records, 4 work sets
We have been datamining WorldCat to generate work-translation ("expression level") records—including the translated title and translator with links to the original work and the author—and adding them to VIAF (Virtual International Authority File), flagged as "xR". Here is a concrete example of the impact – Jane Austen’s Pride and Prejudice originally had 13 translations associated with it; after we generated the xR records for other translations as datamined from WorldCat the number increased to 50.
So far we have added 2 million xR records to VIAF. About the same as the UNESCO database after 25 years (!)
UNESCO has also been interested in identifying and aggregating translations world-wide for an “international bibliography of translations” – they have been at it for over 80 years! Their database represents entries contributed from UNESCO members since 1979 and contains 2 million entries.
Yet when we compare the results for a specific work, like Anatole France’s Crime de Silvestre Bonnard, we can see that the contributions OCLC member libraries have made, through WorldCat, is far greater that all the work invested so far by the UNESCO member states. VIAF, with the xR records datamined from libraries’ contributions to WorldCat supplementing those from VIAF contributors, lists 29 translations in 28 different languages.
We are now working on enhancing the VIAF display to make the contents of title records more visible. This is a mockup of a VIAF Consolidated display – shows work with expressions summary. It shows the title as it is translated in each language.
This is a mockup of the full expressions summary for the work Pêcheur d’Islande by Pierre Loti.
This is a mockup of a VIAF Full display – showing different translations of the title, plus the translator and the earliest determined publication date of each.
Acknowlegments: Karen Smith-Yoshimura & Dan Benson The relationship of a work (with an author) and its associated translations (with their respective translators) is relatively straight-forward. This diagram captures the relationship in two reciprocal links, one from the original Chinese book (blue box) through its property HasTranslation pointing (with blue arrows) to each translation (red boxes), and one from each translation back to the original Chinese work (red arrows) through the property IsTranslationOf . A work can have any number of translations, and there can be multiple translations into the same language which is why identifying the translator is so important.
Acknowledgements: Karen Smith-Yoshimura Machines access VIAF far more than humans. To leverage all the work done by the OCLC cooperative we want to share the relationships we’ve established between original works and their associated translations with the semantic Web. Here is a sample markup of an original Chinese work written by Gao Xingjian, a Chinese Nobel Prize laureate for literature, and one of the translations of his work into English. We marked this up with schema.org but there are two new terms we are proposing, shown here: translator and translationOfWork.
Acknowledgements: Karen Smith-Yoshimura
This work will help us better understand the extent to which information is shared across cultures. Here are some questions we should be able to answer soon!
More information on this project is available on the OCLC Research website at the link shown.
Just think. How many of the 150+ million records that are for non English language works are actually translations of English language records. And how many of the English language records are also translations? It could be as high as 25%?
Once we have these records generated, it will open many new possibilities.
Also, we know we have 300 million records, but how many real resources do we have? GLIMIR will produce these figures.
We are just starting…
VIAF expression level displays will include GLIMIR Ids.
Multilingualism ifla 2014 08
IFLA - Lyon, France 19 August 2014
Working with Karen Smith-Yoshimura, Robert Bremer, Eric Childress, Jean Godby,
Richard Greene, JD Shipengrover, Gail Thornburg, Jenny Toves, Diane Vizine
Goetz, Shenghui Wang, Jay Weitz
• Resources in nearly all
• Contributed by more
than 20,000 libraries
• More than half the
database is for works
not in English
• Bibliographic Records
– Hybrid records
– Parallel records
• Clustered at Work
Create a consolidated
metadata summary for the
content of a work
Objective: Work Level Record
Work Level Record
Create better work
Users like C
• The Content Cluster
– Enables better work record displays by reducing the number of
lines that display for large works
– Enables a choice of format and presents the formats that could be
– Consolidates holdings for identical content
• The Manifestation Cluster is important
– Consolidates holdings at manifestation level
– In the short term allows the record catalogued in the language of
the interface to be chosen for display
– Reduces apparent duplication
– Allows a more accurate count of the number of manifestations in
WorldCat (as opposed to the number of records)
So far 103 million records processed (about 30%)
Multilingual Bibliographic Structure Project
Creates true multi-lingual displays
– At work and manifestation levels
– Using all available data instead of “most appropriate record”
– Generates data
Corrects many of the 28 million records coded
Better control and linking of translations
Input to refinement of work clusters
Smarter data storage
“Most appropriate” questioned
• Worldcat.org selects the most appropriate
record to show to a user as representative of
the work in the short result list and beyond
• The end result will not be very satisfactory from a
multi-lingual viewpoint… here’s why
Which record is better to present to a German speaker?
Build the display from all available data
Multilingual Bibliographic Structure Project
• Work level data, mined from all associated
bibliographic records will be displayed
supplemented with expression / manifestation
level data as the user drills through the short
to fuller versions of the metadata.
End user interface will show works and manifestations not bibliographic records; the
cataloguing client will also show bibliographic records
(Language of work)
• Language tagging of elements, particularly
– Summaries (M21 520)
– Subject headings
• Display in script preferred by the user if data is
• Improve translated interfaces
• Show consolidated holdings as appropriate
Great works are translated
• The cream of the world’s cultural and
knowledge heritage is shared by being
• WorldCat contains many rich cataloguing
records for these translations
GOAL: Data mine the really good records to
improve clustering, presentation, authority records
and linked data
The Iliad 紅樓夢
Dream of the Red Chamber
The Home and the World
Война и миръ
War and Peace
The Tale of Genji
સત્યના પ્રયોગો અથવા આત્મકથા
The Story of My Experiments with Truth [Gandhi autobiography]
Leo Tolstoy: 32 languages
Homer: 28 languages
Rabindranath Tagore: 21
Isaac Bashevis Singer: 17
Najīb Maḥfūẓ: 12 languages
Cao Xueqin: 9 languages
Mahatma Gandhi: 7 languages
Murasaki Shikabu: 7 languages
Improving work clustering
• Inconsistencies cause work clusters to be
incomplete resulting in less than optimal
– Titles without subtitles
– Missing or different forms of uniform title
– Inverted title
– Different coding of original and translated
Generated uniform title authority records will overcome most of these differences
without needing to edit individual records
XR VIAF Record
VIAF ID for Author
IFLA - Lyon, France 19 August 2014
VIAF Linked Data
Title: Journey to the West
Translator: Anthony C. Yu
Title: Journey to the West
Translator: W. J. F. Jenner
Title: Tây du ký bình khảo
Translator: Phan Quân
Title: Monkeys Pilgerfahrt
Translator: Georgette Boner
Markup for the Semantic Web
# Original Work (in Chinese)
schema:creator <http://viaf.org/viaf/102266649> ; # "Gao, Xingjian”
# Translated Work (in English)
schema:creator <http://viaf.org/viaf/102266649> ; # "Gao, Xingjian“
[new]:translator <http://viaf.org/viaf/81663420> ; # "Lee, Mabel"
schema:name "Soul Mountain"@en ;
[new]:translationOfWork <http://worldcat.org/entity/work/id/1215997> “
Understanding information sharing
• What percentage of non-English works are translations of
English works, and vice-versa?
• Which authors are translated the most?
• Which works have been translated into the most languages?
• Which countries translate the most English works, the most
• Which countries translate a new
work the fastest?
Where are we now?
• Work clusters done; ongoing refinement
• GLIMIR clustering done for all [simple] text;
– 103 million records have GLIMIR IDs
• Working on collected works
• Working on VIAF expression displays
• Work level displays in WorldCat.org ++
Data Mining for translations