Transcript of "IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)"
Outline IMPACT final event – The Hague – 26 June 2012 1. Institutional background 2. IMPACT test case Metadata extraction from title pages 3. Strategic goals 4. Preliminary work Evaluation of the FEP pilot 5. Results at the German National Library 6. Perspective Christa Schöning-Walter1 2 | IMPACT event | June 26, 2012 | The German National Library (DNB) The German National Library (DNB) – some facts and figures (I) – some facts and figures (II) − Legal deposit: − Collection size (January 2012): 27 million media units Collecting, cataloguing, archiving and making available to the − Daily input: 1.500 physical units (each with 2 copies) general public all German and − Since 2006: German-language publications, Collection mandate includes non-physical media publications about Germany etc (online publications) from 1913 − DNBG = Law regarding the German National Library − Bibliographic services: − PflAV = Legal Deposit Regulation − National Bibliography − Authority files − Since 2009: Considerations on and implementation of automated − Bibliographic standards cataloguing processes − 2 sites: Leipzig, Frankfurt am Main3 | IMPACT event | June 26, 2012 | 4 | IMPACT event | June 26, 2012 | 1
Target of the IMPACT scenario Starting point Opening questions (summer 2011): Since January 2012: − Experimental application studies in collaboration with − Can metadata extraction from title pages successfully the University of Innsbruck be done by a rule engine in case of simple structured monographic publications? − Using the rule-based exploitation features of FEP (Functional Extension Parser) − Is this useful in order to accelerate the cataloguing processes if no machine-readable metadata from What is FEP? other sources is available? − Software platform for the purpose of analysing the Test case: Theses logical structure of documents − Developed within IMPACT work package EE4 − 14.000 print units annually (Goal: enrichment of OCR output with structure − simple structure !? information)5 | IMPACT event | June 26, 2012 | 6 | IMPACT event | June 26, 2012 | Strategic goals Conceptual design of the workflow Example: http://d-nb.info/1017138931 In particular: Accession Repository FEP results − Making descriptive cataloguing less time-consuming and (Printed media literature processing of printed media faster by units) OAI-Harvester Cataloguing − Partial digitisation − Automated metadata extraction Bibliographic Qualitiy − Result transfer into the bibliographic record Data Provider check record − Quality check and completion of cataloguing by the Statistics staff a Generally: Service partner Scan service OCR output/ Stack − Gaining experience in the area of automated metadata (title page + Indexing extraction / automated cataloguing ToC)7 | IMPACT event | June 26, 2012 | 8 | IMPACT event | June 26, 2012 | 2
The Objective: Automated exploitation of descriptive bibliographic data − Specification, implementation, evaluation and gradually improvement of − Appropriate structure types − Dictionaries (controlled vocabulary, The idea: indicating keywords, Taking bibliographic data over abbreviations etc) from metadata mining tools. − Expert rules − Etc Illustration: University of Innsbruck9 | IMPACT event | June 26, 2012 | 10 | Preliminary work (I) Preliminary work (II) − Specification of the bibliographic statements to be mined − Going over some hundreds of title pages of theses from the title page (scans from 2009-2011 + documents from daily business) Attribute Value − Exploring typical structural patterns / regularities etc, Publication year 2010 such as Examples of indicating phrases to find out Language code /1ger − Prefixes the creator: /1eng von − Phrases von <Verfasser> vorgelegte Dissertation Creator <last name>,<first name> − Notation von Herrn/Frau: Title <full title>:<additional title information>/ vorgelegt von(:) − Position vorgelegt JJJJ von <author statement> vorgelegt dem Fachbereich ... von Size 30 cm Name: 21 cm Name des Verfassers: Theses statement <city name>, <corporate body name>, Expert rules Name der Verfasserin: verfasst von(:) <type of publication>,<year of graduation> eingereicht von11 | IMPACT event | June 26, 2012 | 12 | IMPACT event | June 26, 2012 | ... 3
Preliminary work (III) Preliminary work (IV) Theses statement items (examples): … Choosing / preparing Berlin, ESCP Europe Wirtschaftshochschule − Setting up a sample of documents for evaluation dictionaries for tagging, Berlin, Freie Univ. purposes: matching and mapping Berlin, Humboldt-Univ. Berlin, Steinbeis-Hochsch. − 1.000 theses from several universities purposes: Berlin, Techn. Univ. − Publication year: 2010 – 2011 Berlin, Univ. der Künste − List of universities … − Different dimensions (A- and B-size) which have the right to − Scans: 300 dpi, bitonal graduation (identifying Academic grades (examples): − Transfer format: Pdf (in future: XML files) the corporate bodies) … M.A. Master of Arts / Magister Artium − Ground truth determination: − Name Authority File M.Sc. Master of Science M.Eng. Master of Engineering − Manually region tagging on image files subset (identifying LL.M. Master of Laws / Legum Magister (done in Vietnam by the Aletheia tool) personal names) M.F.A. Master of Fine Arts M.Mus. Master of Music − List of academic grades M.Ed. Master of Education13 | IMPACT event | June 26, 2012 | … 14 | IMPACT event | June 26, 2012 | Document processing in brief Results − Database: Storage of all Second test phase with a revised list of universities (June 2012): available information (OCR output, automatically or manually produced annotations, dictionaries, facts etc) − Input of expert rules − Rule engine: Stepwise proceeding taking intermediary results into account Illustration: University of Innsbruck (1) total conformity (2) complete title + noise (just to be deleted by the staff)15 | IMPACT event | June 26, 2012 | 16 | IMPACT event | June 26, 2012 | 4
Forecast: Feasibility study New ideas − Technical and organisational requirements: − Extraction of defined structures from the body of Operational aspects, technical workflow, interfaces etc monographic publications, such as table of contents, abstracts, pure text (without any introductory remarks, − Further functional enhancements needed: footers, references etc) − Dictionary maintenance: Expanding controlled vocabulary, sorting out unsuitable items etc Target: − Taking additional facts into account: Ground truth etc − Improvement of the results of current automated subject cataloguing projects, such as − Additional expert rules (?) − Thematic classification by machine learning − Additional functions: Language guesser, document techniques size etc − Subject headings obtainment by text analysis − Customising FEP (?) techniques Reducing the noise via preceding structure analysis processes17 | IMPACT event | June 26, 2012 | 18 | IMPACT event | June 26, 2012 | Thank you for your attention. Christa Schöning-Walter Sandra Hamm Staff position ’Automated Cataloguing’ Project leader firstname.lastname@example.org email@example.com German National Library Digital Services Frankfurt am Main, Germany19 | IMPACT event | June 26, 2012 | 5
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.