IMPACT Final Event 26-06-2012 - Summary of IMPACT project & results by Hildelies Balk (KB, IMPACT Project Director)
Click to edit document nameIMPACT: Challengesand solutionsHildelies Balk, IMPACTProject Director, KBNational Library of theNetherlands
IMPACT: Challenges and solutionsOverview of this presentation• Challenges in digitisation of historical full text• IMPACT objectives• Approach• Achievements• Better, Faster, Cheaper
IMPACT: Challenges and solutionsThe content• Shared vision in Europe: all cultural heritage available in digital form in this decade• Billions of pages of historical (pre-1900) text in libraries in Europe• Users expect full text to search, tag and re- use• Just image and metadata not enough 3
IMPACT: Challenges and solutionsThe full text VVt Venetien den 1.Junij, Anno 1618. DJgn i f paffato te S aöJifeert mo?üen/bah .)etgiuotbciraetail)i.r/JtmelchontDecht te / sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe
IMPACT: Challenges and solutionsChallenges to OCR 5
IMPACT: Challenges and solutionsAnswering the challenges: IMPACTIMPACT – Improving Access to Text (2008-2011)• Large-scale integrating research project• Consortium of 26 partners• Coordinated by the National Library of the Netherlands (KB)• Co-funded by EU (FP7 ICT Work Programme)Objectives:Significantly improve mass digitisation of historical printed text by:• Innovating OCR software and language technology• Sharing expertise and building capacity across Europe• Providing facilities for future research and development Making text digitisation better, faster, cheaper!
IMPACT: Challenges and solutions IMPACT Approach• Content holders, researchers and industry work together to find solutions• Based on real life problems in digitisation• Tackle each step in the digitisation workflow from scan to full text -/-/-/-/-/- /-/-/-/-/- /-/-/-/-/- /-/-/-/-/- /-/-/-/-/- OCR Post correction and Enrichment ABBY FR CONCERT IBM Image enhancement:Preparation and IBM Adaptive Error Profiler LMU Binarisationscanning: Dictonaries/interface noise removal Language resources 9 partnersguidelines and Segmentation and geometrical defects LMU,INL Platform for document case studies Document analysis correction Experimental engines understanding based on OCR UIBKAll partners NSCR,USAL, ABBYY USAL,NCSR,ABBYY USAL,NCSR,UIBK 8
IMPACT: Challenges and solutionsIMPACT Approach - continued • Tools to be coupled in Interoperability Framework • Tested with Evaluation tools and metrics • Against representative set of test data with Ground Truth • Basis for further research and development 9
IMPACT: Challenges and solutionsAchievements: summaryOn market: Improved ABBYY FR Engine 10, Recogition Server 3, Cloud OCRIn use in productive environment:• Service for document structure recognition• Dutch and Slovene dictionary• AlethiaReady for testing in productive environment:• Adaptive OCR engine• Tools for OCR correction with volunteer involvement• Computer lexica for nine languages• Digitisation Framework with evaluation tools and dataset• Knowledge bank with guidelines and learning resourcesFor future development:• Novel Approaches to preprocessing, OCR and post correction• New language resources with Tools for lexicon buildingimpact Centre of Competence for digitisation• Added value: Unique network bringing together experts from different communities
IMPACT: Challenges and solutions Better: rule set for extracting table of content entries from historical books outperforms best results of the Results: better & faster INEX competition 2011 • All tools evaluated in different test scenarios on IMPACT dataset • All individual tools show improvement on state of the art Faster: postcorrection with • Some examples of results – there is more! Error Profiler up to 2,7 times faster than withoutBetter: Tested on Better: hybrid line Better: recognition of38718 randomly segmentation on 2.700 old fonts FR9→FR10 Faster: CONCERTselected historical text lines SOA 90,9 % → 25% reduction of errors increases correctionimages and IMPACT 98,8% speed up to 40%achieved a successof 98.93% (SoA up to Better, faster:Adaptive97.3% OCR on small testset halves FOM (post Better: language processing level required) resources show improvement for all 9 languages OCR ABBY FR Post correction and EnrichmentImage enhancement: IBM Adaptive CONCERT IBMBinarisation Segmentation andNoise removal Dictonaries / interface Error Profiler LMU Document analysisGeometrical defects LMU,INL Language resources 9 partners USAL, NCSR, ABBYYcorrection Experimental OCR engines Document Understanding PlatformNSCR, USAL, ABBYY USAL, NCSR, UIBK UIBK 12
IMPACT: Challenges and solutionsResults: cheaperIndustry in IMPACT:• ABBYY FR Historic Fonts Module more than 10 times cheaper; more flexible rates overall• IBM Adaptive OCR and CONCERT: flexible ratesResearch in IMPACT:• Key Language resources free• All tools by research partners free for research and free / low rates on non commercial use (individual licensing required), subject to volume, kind of use and material, support etc.Framework:• Digitisation Framework free and open source• Open source wrapper to plug in other tools• Fruitful contacts with new tool providers
IMPACT: Challenges and solutionsBenefitsFor the digital library• Rough average of all tests by developers on IMPACT dataset indicates consistent improvement of up to 20%• Better access, faster and cheaper productionFor the end user• main interest: retrieval = words searched and found correctly• Preliminary results of ABBYY FR 10 with Dutch lexicon on difficult material (Dutch 17th century newspaper): 15% increase of words found For 1 M words this means 150 K more words found ...and this is just the beginning!