0
Hildelies Balk, IMPACT Project Director, KB National Library of the Netherlands IMPACT: Challenges and solutions
Overview of this presentation <ul><li>Challenges in digitisation of historical full text </li></ul><ul><li>IMPACT objectiv...
The Content <ul><li>Shared vision in Europe: all cultural heritage available in digital form in this decade </li></ul><ul>...
The full text <ul><li>VVt Venetien den 1.Junij, Anno 1618. </li></ul><ul><li>DJgn i f paffato te S' aö'Jifeert mo?üen/bah ...
Challenges to OCR:
Language Challenges Historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelde...
Answering  the challenges – IMPACT <ul><li>IMPACT – Improving Access to Text (2008-2011) </li></ul><ul><li>Large-scale int...
IMPACT - Approach <ul><li>Content holders, researchers and industry work together to find solutions   </li></ul><ul><li>Ba...
IMPACT – Approach continued <ul><li>Tools to be coupled in Interoperability Framework  </li></ul><ul><li>Tested with Evalu...
IMPACT Achievements: summary   <ul><li>On market: Improved commercial  OCR  </li></ul><ul><li>Ready for testing in product...
Results: Better and Faster <ul><li>All tools evaluated in different testscenarios on IMPACT dataset </li></ul><ul><li>All ...
Results: Cheaper <ul><li>Industry in IMPACT: </li></ul><ul><li>ABBYY FR Historic Fonts Module more than 10 times cheaper; ...
Benefits for the Digital Library <ul><li>Reminder: IMPACT is a RESEARCH Project  </li></ul><ul><li>Fitness for productive ...
Benefits for the End User <ul><li>Users of the Libraries:  </li></ul><ul><ul><li>researchers in the humanities  </li></ul>...
Enjoy!
Upcoming SlideShare
Loading in...5
×

IMPACT Final Conference - Hildelies Balk-Pennington de Jongh

2,726

Published on

Digitisation challenges & achievements so far on the IMPACT project by Hildalies Balk-Pennington de Jongh

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,726
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Quality of full text for historical documents mostly poor Period 1600-1900 (much) less then half of words found in search
  • Damaged pages, bleed through, difficult layout, historic fonts …
  • Spelling variants, orthographical variants, inflected forms …and more
  • Transcript of "IMPACT Final Conference - Hildelies Balk-Pennington de Jongh"

    1. 1. Hildelies Balk, IMPACT Project Director, KB National Library of the Netherlands IMPACT: Challenges and solutions
    2. 2. Overview of this presentation <ul><li>Challenges in digitisation of historical full text </li></ul><ul><li>IMPACT objectives </li></ul><ul><li>Approach </li></ul><ul><li>Achievements </li></ul><ul><li>Better, Faster, Cheaper </li></ul>
    3. 3. The Content <ul><li>Shared vision in Europe: all cultural heritage available in digital form in this decade </li></ul><ul><li>Billions of pages of historical (pre-1900) text in libraries in Europe </li></ul><ul><li>Users expect full text to search, tag and re-use </li></ul><ul><li>Just image and metadata not enough </li></ul>
    4. 4. The full text <ul><li>VVt Venetien den 1.Junij, Anno 1618. </li></ul><ul><li>DJgn i f paffato te S' aö'Jifeert mo?üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te / </li></ul><ul><li>sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .met </li></ul><ul><li>beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe </li></ul>
    5. 5. Challenges to OCR:
    6. 6. Language Challenges Historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald we ë led
    7. 7. Answering the challenges – IMPACT <ul><li>IMPACT – Improving Access to Text (2008-2011) </li></ul><ul><li>Large-scale integrating research project </li></ul><ul><li>Consortium of 26 partners </li></ul><ul><li>Coordinated by the National Library of the Netherlands (KB) </li></ul><ul><li>Co-funded by EU (FP7 ICT Work Programme) </li></ul><ul><li>Objectives: </li></ul><ul><li>Significantly improve mass digitisation of historical printed text by: </li></ul><ul><li>Innovating OCR software and language technology </li></ul><ul><li>Sharing expertise and building capacity across Europe </li></ul><ul><li>Providing facilities for future research and development </li></ul>
    8. 8. IMPACT - Approach <ul><li>Content holders, researchers and industry work together to find solutions </li></ul><ul><li>Based on real life problems in digitisation </li></ul><ul><li>Tackle each step in the digitisation workflow from scan to full text </li></ul>Image enhancement: Binarisation noise removal geometrical defects correction NSCR,USAL, ABBYY OCR ABBY FR IBM Adaptive Dictonaries/interface LMU,INL Experimental engines USAL,NCSR,UIBK <ul><li>Segmentation and Document analysis </li></ul><ul><li>USAL,NCSR,ABBYY </li></ul>Post correction and Enrichment CONCERT IBM Error Profiler LMU Language resources 9 partners Document Understanding Platform UIBK Preparation and scanning: guidelines and case studies All partners -/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
    9. 9. IMPACT – Approach continued <ul><li>Tools to be coupled in Interoperability Framework </li></ul><ul><li>Tested with Evaluation tools and metrics </li></ul><ul><li>Against representative set of test data with Ground Truth </li></ul><ul><li>Basis for further research and development </li></ul>
    10. 10. IMPACT Achievements: summary <ul><li>On market: Improved commercial OCR </li></ul><ul><li>Ready for testing in productive environment: </li></ul><ul><ul><li>Adaptive OCR engine </li></ul></ul><ul><ul><li>Tools for OCR correction with volunteer involvement </li></ul></ul><ul><ul><li>Computerlexica for nine languages </li></ul></ul><ul><ul><li>Digitisation Framework with evaluation tools and dataset </li></ul></ul><ul><ul><li>Knowledge bank with guidelines and learning resources </li></ul></ul><ul><ul><li>Service for for print space recognition </li></ul></ul><ul><li>For future development: </li></ul><ul><ul><li>Novel Approaches to preprocessing, OCR and post correction </li></ul></ul><ul><ul><li>New language resources with Tools for lexicon building </li></ul></ul><ul><li>Centre of Competence for digitisation to start 1 january 2012 </li></ul><ul><li>Added value: U nique network bringing together experts from different communities </li></ul>
    11. 11.
    12. 12. Results: Better and Faster <ul><li>All tools evaluated in different testscenarios on IMPACT dataset </li></ul><ul><li>All individual tools show improvement on SOA </li></ul><ul><li>Some examples of results – there is more! </li></ul>Image enhancement: Binarisation noise removal geometrical defects correction NSCR,USAL, ABBYY OCR ABBY FR IBM Adaptive Dictonaries/interface LMU,INL Experimental engines USAL,NCSR,UIBK <ul><li>Segmentation and Document analysis </li></ul><ul><li>USAL,NCSR,ABBYY </li></ul>Post correction and Enrichment CONCERT IBM Error Profiler LMU Language resources 9 partners Document Understanding Platform UIBK Better:hybrid line segmentation on 2700 text lines SOA 90,9 ->98,8% IMPACT Better:recognition old fonts FR9->FR10 improved 25% Better, faster:Adaptive OCR on small testset halves FOM (post processing level required) Faster: CONCERT increases correction speed up to 40% Faster: postcorrection with Error Profiler up to 2,7 times faster than without Better: page split detected on 3.000 images from dataset: SOA 73%->94% IMPACT Better: language resources show improvement for all 9 languages
    13. 13. Results: Cheaper <ul><li>Industry in IMPACT: </li></ul><ul><li>ABBYY FR Historic Fonts Module more than 10 times cheaper; more flexible rates overall </li></ul><ul><li>IBM Adaptive OCR and CONCERT: flexible rates </li></ul><ul><li>Research in IMPACT: </li></ul><ul><li>Key Language resources free </li></ul><ul><li>All tools by research partners free for research and free/low rates on non commercial use (individual licensing required), subject to volume, kind of use and material, support etc . </li></ul><ul><li>Framework: </li></ul><ul><li>Digitisation Framework free and open source </li></ul><ul><li>Open source wrapper to plug in other (free) tools </li></ul><ul><li>Fruitful contacts with new open source tool providers </li></ul><ul><li>Increasing number of IMPACT tools Open Source </li></ul>
    14. 14. Benefits for the Digital Library <ul><li>Reminder: IMPACT is a RESEARCH Project </li></ul><ul><li>Fitness for productive use of tools already exceeds expectations </li></ul><ul><li>Rough average of all tests by developers on IMPACT dataset indicates consistent improvement of up to 20% </li></ul><ul><li>What does this mean for the Library objectives: </li></ul><ul><li>Better access, faster and cheaper production -> measured by: retrieval, time and money spent </li></ul><ul><li>Q1-3 2011: Pilots carried out in house -> focus on user feedback and implementation issues </li></ul><ul><li>Q4 and beyond: pilots planned to measure all aspects </li></ul><ul><li>First test in productive environment: ABBYY FR 10 with Dutch lexicon on Dutch 17th C Newspapers </li></ul><ul><li>20% increase in Word Accuracy (LD) </li></ul><ul><li>15% improvement in word retrieval </li></ul>
    15. 15. Benefits for the End User <ul><li>Users of the Libraries: </li></ul><ul><ul><li>researchers in the humanities </li></ul></ul><ul><ul><li>Greater public </li></ul></ul><ul><li>End user Only interest: retrieval = words searched and found correctly </li></ul><ul><li>Preliminary results of OCR combined with dictionary on difficult material (17th century newspapers) </li></ul><ul><li>indicate already 15% increase of words found </li></ul><ul><li>-> For 1 M words this means 150 K more words found </li></ul><ul><li>And this is just the beginning </li></ul>
    16. 16. Enjoy!
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×