Your SlideShare is downloading. ×
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser

952

Published on

Katrien Depuydt gives presentation on the 'Structural analysis of documents and the Functional Extension Parser (FEP)' at the IMPACT Demo Day at the British Library on the 12th of July 2011.

Katrien Depuydt gives presentation on the 'Structural analysis of documents and the Functional Extension Parser (FEP)' at the IMPACT Demo Day at the British Library on the 12th of July 2011.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
952
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Down the Islands. A voyage to the Caribbees ... With illustrations. (1888) BL Demonstrator Set, [prima ids 465024-465278] William Agnew Paton The images TOC1 input.png and TOC2 input.png show the embedded fulltext (OCR output) within the pdf output of ABBYY Finereader. It is interessting to see that in "TOC1 input.png" there are 3 errors from the ocr analysis which have a strong impact on quality of the fep analysis results. a) The link to pagenumbers from the first two TOC entries, Introduction and Chapter I, are not detected by the OCR. b) The third Toc entry (Chapter II) links according to the OCR to the page labelled with the pagenumber 2 (instead of 22) These errors have the following impact on the analysis (which can be seen on Image TOC1 output.png): a) The entry Introduction is missed completely b) The second toc entry ends after the two centered lines and has no link to the book content c) the second part of the second toc entry is grouped together with the third toc entry and has a wrong link to pagenumber 2 instead of 22. d) The fourth toc entry contains no ocr errors and is therefore grouped and also linked correctly. The seccond toc page (TOC2 input.png) does not contain any ocr errors and also the analysis results of the fep are correct. Concerning the TOC reconstruction the fep performs as follows: 25 TOC entries in total: 1 TOC entry was missed, 2 TOC entries are grouped incorrectly 1 TOC entry has no link 1 TOC entry has a wrong link 22 TOC entries are completely correct. The Images Example1.png and Example2.png show the results of the logical structure analysis of the fep. Correct labels are marked with a green, wrong labels with a red border.
  • Transcript

    • 1. Structural analysis of documents Functional Extension Parser (FEP) Günter Mühlberger University Innsbruck Library (UIBK)
    • 2. Agenda
      • Introduction
      • Features
        • What do we recognise with the structural analysis?
      • Benefits
        • Why is structural analysis useful?
      • Architecture
        • How does it work?
      • Results
        • How good are we?
      • Roadmap
        • When will it come into being?
      • Business
        • Which offers will be available?
    • 3. Introduction
      • Document understanding platform
      • Try to enhance and exploit the logical structure of documents for
        • Display
        • Navigation
        • Retrieval
      • Enhance OCR output with structural metadata
        • Fully automated processing
        • Interactive correction
      IMPACT EVA/MINERVA 12 th Nov. 2008
    • 4. Features
      • General
        • We are able to recognise all structural elements which have some layout representation: e.g. region, size, typeface, distance to other elements, etc.
        • Focus in IMPACT: Basic features which are typical for all documents
        • Rules set can be extended or specified according to other datasets
          • E.g. journals, dissertations, index cards, yearbooks, newspapers, etc.
        • The better the OCR, the better our structural analysis
      • Basic features for books
        • Page numbers
        • Running titles (headers)
        • Print space
        • Footnotes
        • Signature marks
        • Headings (within the running text)
        • Table of contents entries (additional to headings)
        • Front/Body/Back
        • Paragraphs
    • 5.
      • Print space
      • Headings
      • Footnotes
    • 6.
      • Running title (header)
      • Page number
      • Signature mark
    • 7.
      • Table of contents
        • (linked with headings in the running text, respectively page numbers)
    • 8. Benefits (1)
      • Display
        • Correct print space allows to display images centred (no flipping between pages)
      • Search & retrieval
        • Scoring of results
          • Could take into account structural data (headings, footnotes)
        • Noise reduction
          • Front, body, back are separated, text from the front is often misleading
          • Running titles repeat the same words
          • Footnotes can be included or excluded
        • Facetted search
          • Results can be displayed for running text, footnotes, headings
    • 9. Benefits (2)
      • Navigation
        • Page numbers allow usage of original table of contents
        • Original table of contents can be linked with headings/page numbers in the book
      • Document editing
        • Further mark up (e.g. TEI) is supported
        • Manual preparation for Print-on-Demand is eased (print space)
        • Selective OCR correction can be applied:
        • E.g. only headings, running text, footnotes could be fed to CONCERT
      • Document matching
        • Contributions or footnotes can be matched with existing bibliographical databases
    • 10.
      • Improved display in the Internet and PDF
    • 11.
      • Refinement of full-text search
      • Facets for e.g.
        • Running text
        • Footnotes
        • Headings
      • Less noise
        • Running titles, signature marks excluded from search
    • 12.
      • Clickable table of contents entries
        • Google style
      • Selective OCR correction
        • Correct only ToC, headings, footnotes, etc.
    • 13.
      • Matching of documents with external sources
        • Match footnotes with library catalogues (bibliographies)Clickable table of content
        • Match table of contents entries and headings with bibliographies
    • 14.
      • Improved editing
        • Alternating print spaces for Print on Demand
        • Further processing for TEI editions etc.
    • 15. Architecture
      • Input
        • Results from OCR processing on word level (coordinates)
        • E.g. ALTO file, ABBYY XML file or Google HTML
      • Output
        • Structural annotations for recognized text features, e.g. page numbers, running titles, headings, etc.
        • E.g. XML, ALTO, METS, TEI, etc.
      • General workflow
        • OCR result files are parsed (FEP general XML format)
        • Rules set is applied to the dataset (rules are managed by rules engine)
        • Results are stored in a database
        • Export on various levels is provided
      • Optional
        • Online or offline correction (GUI)
        • Adaptation of rules set
        • Quality assurance on basis of ground truth
    • 16.
    • 17. The FEP Core
      • Based on expert-system like rule engine for java (Jess)
      • Both manually crafted rules and rules obtained by machine learning
      • Uses fuzzy logic to deal with uncertainty
      • Typical rules:
      • IF there is a numeral in the first line of the page AND this numeral is centred THEN this numeral may be the page number
      • IF there is a numeral in the first line of the page AND this numeral is at the right hand side of the page AND this numeral is an odd number THEN this numeral may be the page number
      • IF there is a numeral in the first line of the page AND this numeral is at the left hand side of the page AND this numeral is an even number THEN this numeral may be the page number.
      IMPACT EVA/MINERVA 12 th Nov. 2008
    • 18. Results
      • Basic rules set
        • General features for books from 1700 to 2000
        • Dataset of 155 books, 30.673 pages (141 training set, 41 evaluation set)
        • All books were manually annotated (ground truth)
      • Recall, Precision, F-Measure
        • E.g. 10 lines with headings in a book. We find 12 lines, 8 of them are correct, 4 are false.
        • Recall = 8 of 10 = 0,8
        • Precision = 8 of 12 = 0,66
        • F-Measure = 2*0.8*0.66/(0.8+0.66) = 0,72
      • More explanations
        • Important: We are counting lines, not structural items!
          • E.g. a heading consists of two lines (often with different size of typeface we have to find both to succeed)
        • Difference between training and evaluation sets are marginal
    • 19. Results on Evaluation Set Recall Precision F-measure Running text 0,99 0,98 0,98 Footnotes 0,83 0,89 0,86 Page numbers 0,97 1 0,98 Running titles 0,97 1 0,98 Heading 0,85 0,80 0,82 Signature marks 0,68 0,89 0,77
    • 20. Roadmap
      • Summer 2011: Beta version
        • Integration into IMPACT Interoperability Platform
        • Basic rules set: books from 1700 to 1900
      • End of the year: Version 1.0
        • Full featured version
        • Enhanced online correction interface
        • FEP as a service, not as a product for local installation
    • 21. Business offers
      • Web-service for processing single volumes and correction
        • Will be integrated into eBooks-on-Demand EOD Network
        • Already now 30 libraries are uploading their images to OCR server in Innsbruck
        • FEP will be an additional service for general material
        • Similar offers can be made to other libraries or networks as well
      • Adaptation of rules set
        • For specific datasets much more can be detected than just the basic features
        • E.g. journals with a fixed structure over many years or parliamentary papers, dissertations, research papers, etc.
      • Onsite installations
        • Not our focus, but could be done for very large datasets or due to legal requirements (e.g. Google images)
    • 22.
    • 23.
    • 24. IMPACT EVA/MINERVA 12 th Nov. 2008
    • 25.
    • 26. Results: TOC
      • 25 TOC entries in total
      • 22 TOC entries are completely correct
      • 1 TOC entry was missed
      • 2 TOC entries are grouped incorrectly
      • 1 TOC entry has no link
      • 1 TOC entry has a wrong link
      IMPACT EVA/MINERVA 12 th Nov. 2008
    • 27. Thank you for your attention!

    ×