Word Occurrence Based Extraction
of Work Contributors from
Statements of Responsibility

Nuno Freire
The European Library
...
Overview
Statements of responsibility from library bibliographic data:
“French Canadian freely arranged by Katherine K. Da...
Outline
 The context
• The ARROW rights infrastructure
• The use of national bibliographies in ARROW






The proble...
The ARROW rights
infrastructure
 ARROW aims to support mass digitisation projects
with automated ways to clear the rights...
What is ARROW
 A rights infrastructure and system for the
identification of:
• Rights status
• In or out of copyright
• I...
Sources of Information in ARROW
 ARROW makes information available
from several sources:
• The European Library:
• Nation...
The ARROW
Workflow
The Role of Libraries
The Role of Libraries
••NationalLibraries as Metadata Providers
National Libraries as Metadata Provi...
The Role of The European Library (TEL)
The Role of The European Library (TEL)
•To match library requests with national bib...
The Role of Books-in-Print (BIP)
The Role of Books-in-Print (BIP)
••Toprovide data about in print/out of print status
To p...
The Role of Reproduction Rights Organisation (RRO)
The Role of Reproduction Rights Organisation (RRO)
•RROs as Metadata Pr...
Statements of responsibility

 These statements usually contain information
about authorship, editors, photographers,
tra...
Examples of statements of
responsibility
“French Canadian freely arranged by Katherine K. Davis”.
“ed. by Peter Noever ; w...
The problem
 National bibliographies are reliable on
representing in structured form the first author of
a work
 But sec...
The approach
 To approach the problem as a Named Entity Recognition
task in text that may not be grammatically correct, t...
The process – pre-processing

 A pre-processing of each national
bibliography is performed:
• Word frequency is calculate...
The process – bibliographic record
processing
 The named entity recognition is performed for a
record as follows:
• State...
The process – named entity
recognition
Possible token sequences used to locate person names:
(in Augmented Backus–Naur For...
Evaluation data set

(size of bibliographies and evaluation samples)
National Bibliography
British Library
German National...
Evaluation results
Exact match
metric

Dataset

Partial match
metric

Precision
British Library
German National Library
Na...
Evaluation results analysis

 The main causes of recognition errors:
• Foreign person names negatively affected recall
• ...
Conclusions
 The approach performed reliably in most
languages and bibliographic datasets
• Datasets of at least one mill...
Future work
 Evaluation of the impact of this solution on the
final results of the rights clearance process of
ARROW
 Bu...
Acknowledgments
 The European Library
• Marcela Strelcova, Chiara Latronico and
Eva Kralt-Yap

 Associazione Italiana Ed...
T hank you
Questions or comments?
Contact:
Nuno Freire – nuno.freire@kb.nl
Upcoming SlideShare
Loading in …5
×

Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility

324 views

Published on

This paper addresses the identification of all contributors of an intellectual work, when they are recorded in bibliographic data but in unstructured form. National bibliographies are very reliable on representing the first author of a work, but frequently, secondary contributors are represented in the statements of responsibility that are transcribed by the cataloguer from the book into the bibliographic records. The identification of work contributors mentioned in statements of responsibility is a typical motivation for the application of information extraction techniques. This paper presents an approach developed for the specific application scenario of the ARROW rights infrastructure being deployed in several European countries to assist in the determination of the copyright status of works that may not be under public domain. Our approach performed reliably in most languages and bibliographic datasets of at least one million records, achieving precision and recall above 0.97 on five of the six evaluated datasets. We conclude that the approach can be reliably applied to other national bibliographies and languages.

Published in: Technology, Education, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
324
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility

  1. 1. Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility Nuno Freire The European Library TPDL-2013 Valletta, September 2013
  2. 2. Overview Statements of responsibility from library bibliographic data: “French Canadian freely arranged by Katherine K. Davis”. “ed. by Peter Noever ; with a forew. by Frank O. Gehry; and contrib. by Coop Himmelblau.” “W. Lange, A.C. Zeven and N.G. Hogenboom, editors” “Vicente Aleixandre ; estudio previo, selección y notas de Leopoldo de Luis” Extracting work contributors for use in a rights infrastructure: ARROW http://arrow-net.eu
  3. 3. Outline  The context • The ARROW rights infrastructure • The use of national bibliographies in ARROW     The problem The approach Evaluation Conclusion and future work
  4. 4. The ARROW rights infrastructure  ARROW aims to support mass digitisation projects with automated ways to clear the rights of the books to be digitised.  To identify and clear the rights associated with a book a complex process needs to be undertaken: • • • • • Determine the work(s) contained within the book Identify all the other expressions of the same work(s) Identify the publisher(s) and contributor(s) involved Determine the dates of publication at work level Determine whether that work(s), and not the book itself, is still in commerce • If necessary, obtain any licenses from the rights holders or collective rights organizations 4
  5. 5. What is ARROW  A rights infrastructure and system for the identification of: • Rights status • In or out of copyright • In or out of print / commercialised or not • Rights • Which rights are involved • Right holders • Authors • Publishers • How and where to clear the rights • Orphan Works and their registration 5
  6. 6. Sources of Information in ARROW  ARROW makes information available from several sources: • The European Library: • National bibliographies - to identify the book and to cluster it with all other books containing the same intellectual work • Virtual International Authority File - to better identify the authors and support the identification of in copyright works • Books in Print database - to know if any of the books concerned are actively commercialised by any publisher • Reproduction Rights Organisation – to see if they know or can trace the rightholders 6
  7. 7. The ARROW Workflow
  8. 8. The Role of Libraries The Role of Libraries ••NationalLibraries as Metadata Providers National Libraries as Metadata Providers •• Provide the National Bibliographies to The Provide the National Bibliographies to The European Library European Library
  9. 9. The Role of The European Library (TEL) The Role of The European Library (TEL) •To match library requests with national bibliographies •To match library requests with national bibliographies ••Identifyall other manifestations that potentially share Identify all other manifestations that potentially share intellectual work with a manifestation intellectual work with a manifestation ••Tocreate a Work record: work metadata, manifestations, To create a Work record: work metadata, manifestations, contributors, etc. contributors, etc.
  10. 10. The Role of Books-in-Print (BIP) The Role of Books-in-Print (BIP) ••Toprovide data about in print/out of print status To provide data about in print/out of print status ••Toprovide data about publishers To provide data about publishers •To add new manifestation records of the work •To add new manifestation records of the work
  11. 11. The Role of Reproduction Rights Organisation (RRO) The Role of Reproduction Rights Organisation (RRO) •RROs as Metadata Provider •RROs as Metadata Provider •• To provide data about authors and publishers To provide data about authors and publishers •• To provide data about available licenses To provide data about available licenses … …
  12. 12. Statements of responsibility  These statements usually contain information about authorship, editors, photographers, translators, and others involved in creating the work  In printed books, the statement of responsibility is typically present on the title page • The statement of responsibility is transcribed by the cataloguer exactly as it appears in the book (according to Anglo-American Cataloguing Rules)
  13. 13. Examples of statements of responsibility “French Canadian freely arranged by Katherine K. Davis”. “ed. by Peter Noever ; with a forew. by Frank O. Gehry; and contrib. by Coop Himmelblau.” “W. Lange, A.C. Zeven and N.G. Hogenboom, editors” “by Pamela and Neal Priestland” “Vicente Aleixandre ; estudio previo, selección y notas de Leopoldo de Luis”
  14. 14. The problem  National bibliographies are reliable on representing in structured form the first author of a work  But secondary contributors are often not represented in structured form  Secondary contributors may reside only within the statements of responsibility
  15. 15. The approach  To approach the problem as a Named Entity Recognition task in text that may not be grammatically correct, thus lacking lexical evidence  Some requirements from the ARROW context • Easily applicable to several languages • The outcomes of the recognition task must be explainable  Design decisions • Exploring the structured data within national bibliographies • By analysis of the frequency of word occurrences in names of persons, and in other textual data • Using word occurrence frequency allows to • bypass the need for building training sets • be able to provide simpler explanations of the name recognition results
  16. 16. The process – pre-processing  A pre-processing of each national bibliography is performed: • Word frequency is calculated • The frequency values are normalized, for independence on the size of the national bibliography • The pre-processing results in four dictionaries: • • • • Words in titles Words in person’s surnames Words in other parts of person’s names, than the surname Words that appear in lowercase in person names (such as “von” in German names, or “de” in Portuguese names) • The dictionaries contain the normalized frequency associated the words
  17. 17. The process – bibliographic record processing  The named entity recognition is performed for a record as follows: • Statement of responsibility is tokenized • The person names are recognized by comparing the tokens with the dictionaries • The recognized names are compared against the names of the contributors present in the structured fields of the record. • If no similar name exists in the record, the contributor is added to the record in a structured data field
  18. 18. The process – named entity recognition Possible token sequences used to locate person names: (in Augmented Backus–Naur Form) non-ambiguous-surname / ( initial / non-ambiguous-first-name / non-ambiguous-surname / non-ambiguous-non-capitalized-name ) *(initial / first-name / surname / non-capitalized-name) surname (more details on the definition of these tokens are included in the paper)
  19. 19. Evaluation data set (size of bibliographies and evaluation samples) National Bibliography British Library German National Library National Library of the Netherlands National Library of Greece Central Institute for the Union Catalogue of Italian Libraries Royal Library of Belgium Total records Main language Evaluation sample Statements of responsibility Referred Persons 13.4 million English 205 328 9.4 million German 200 378 3.2 million Dutch 200 335 0.4 million Greek 297 379 12.4 million Italian 224 297 203 387 1329 2104 1 million French and Dutch Total:
  20. 20. Evaluation results Exact match metric Dataset Partial match metric Precision British Library German National Library National Library of the Netherlands National Library of Greece Central Institute for the Union Catalogue of Italian Libraries Royal Library of Belgium Overall: Recall Precision Recall 0.981 0.975 0.979 0.934 0.991 0.992 0.991 0.992 0.973 0.875 0.977 0.979 0.656 0.414 0.758 0.868 0.97 0.896 0.971 0.973 0.981 0.948 0.959 0.837 0.981 0.958 0.982 0.963
  21. 21. Evaluation results analysis  The main causes of recognition errors: • Foreign person names negatively affected recall • Names of persons used in names of organizations negatively affected precision • Two persons with same surname mentioned together negatively affected recall. As for example: • “hrsg. von Volker und Michael Kriegeskorte” • “by Pamela and Neal Priestland”
  22. 22. Conclusions  The approach performed reliably in most languages and bibliographic datasets • Datasets of at least one million records • Precision and recall above 0.97 on all but one dataset  The results obtained on the Greek national bibliography were not satisfactory • This dataset has distinct characteristics from the others: • smaller size, • a different alphabet • different language • Further investigation of the Greek national bibliography is necessary
  23. 23. Future work  Evaluation of the impact of this solution on the final results of the rights clearance process of ARROW  Building the dictionaries from comprehensive source of names of persons • Virtual International Authority File (VIAF) • International Standard Name Identifier (ISNI)  Further functionality: • recognition of organization names • recognition of the role of the recognized contributors (illustrator, editor, etc.)  Other application scenarios • Functional Requirements for Bibliographic Records • Resource Description and Access
  24. 24. Acknowledgments  The European Library • Marcela Strelcova, Chiara Latronico and Eva Kralt-Yap  Associazione Italiana Editori  University of Innsbruck  This work was partially supported by the ARROWplus project, with co-funding by the European Commission programme eContentplus Co-funded by the Community programme eContentplus
  25. 25. T hank you Questions or comments? Contact: Nuno Freire – nuno.freire@kb.nl

×