SlideShare a Scribd company logo
1 of 15
Click to edit document name




IMPACT: Challenges
and solutions

Hildelies Balk, IMPACT
Project Director, KB
National Library of the
Netherlands
IMPACT: Challenges and solutions




Overview of this presentation



•   Challenges in digitisation of historical full text
•   IMPACT objectives
•   Approach
•   Achievements
•   Better, Faster, Cheaper
IMPACT: Challenges and solutions




The content

• Shared vision in
  Europe: all cultural
  heritage available in
  digital form in this
  decade
• Billions of pages of
  historical (pre-1900)
  text in libraries in
  Europe
• Users expect full text
  to search, tag and re-
  use
• Just image and
  metadata not enough

                                                              3
IMPACT: Challenges and solutions




The full text




     VVt Venetien den 1.Junij, Anno 1618.
     DJgn i f paffato te S' aö'Jifeert mo?üen/bah
     .)etgi'uotbciraetail)i.r/JtmelchontDecht
     te / sbnbe bele btr felbrr geiufttceert baer bnber
     eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu
     enbeeemgljen bifet Cbeiiupcen berbonbru befe
IMPACT: Challenges and solutions




Challenges to OCR




                                                       5
IMPACT: Challenges and solutions




Language challenges




 Historical variants of the Dutch word ‘wereld’ (world):

    werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds
    weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt
    werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts
    tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt
    weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts
    weereldt wereldje waereldje weurlt wald weëled
IMPACT: Challenges and solutions




Answering the challenges: IMPACT

IMPACT – Improving Access to Text (2008-2011)
• Large-scale integrating research project
• Consortium of 26 partners
• Coordinated by the National Library of the Netherlands (KB)
• Co-funded by EU (FP7 ICT Work Programme)

Objectives:
Significantly improve mass digitisation of historical printed text by:
• Innovating OCR software and language technology
• Sharing expertise and building capacity across Europe
• Providing facilities for future research and development

                     Making text digitisation better, faster, cheaper!
IMPACT: Challenges and solutions




  IMPACT Approach

• Content holders, researchers and industry work together to find solutions
• Based on real life problems in digitisation
• Tackle each step in the digitisation workflow from scan to full text

                                                             -/-/-/-/-/-
                                                             /-/-/-/-/-
                                                             /-/-/-/-/-
                                                             /-/-/-/-/-
                                                             /-/-/-/-/-

                                                                    OCR               Post correction and Enrichment
                                                                  ABBY FR                     CONCERT IBM
                  Image enhancement:
Preparation and                                                 IBM Adaptive                Error Profiler LMU
                  Binarisation
scanning:                                                   Dictonaries/interface
                  noise removal                                                       Language resources 9 partners
guidelines and                           Segmentation and
                  geometrical defects                             LMU,INL                Platform for document 
case studies                            Document analysis
                  correction                                Experimental engines    understanding based on OCR UIBK
All partners      NSCR,USAL, ABBYY      USAL,NCSR,ABBYY
                                                             USAL,NCSR,UIBK

                                                                                                                   8
IMPACT: Challenges and solutions




IMPACT Approach - continued

 •   Tools to be coupled in Interoperability Framework
 •   Tested with Evaluation tools and metrics
 •   Against representative set of test data with Ground Truth
 •   Basis for further research and development




                                                                                    9
IMPACT: Challenges and solutions




Achievements: summary
On market: Improved ABBYY FR Engine 10, Recogition Server 3, Cloud OCR
In use in productive environment:
• Service for document structure recognition
• Dutch and Slovene dictionary
• Alethia
Ready for testing in productive environment:
• Adaptive OCR engine
• Tools for OCR correction with volunteer involvement
• Computer lexica for nine languages
• Digitisation Framework with evaluation tools and dataset
• Knowledge bank with guidelines and learning resources
For future development:
• Novel Approaches to preprocessing, OCR and post correction
• New language resources with Tools for lexicon building
impact Centre of Competence for digitisation
• Added value: Unique network bringing together experts from different communities
IMPACT: Challenges and solutions




                                   11
IMPACT: Challenges and solutions


                                                                                          Better: rule set for extracting table of
                                                                                          content entries from historical books
                                                                                          outperforms best results of the
  Results: better & faster                                                                INEX competition 2011


  •     All tools evaluated in different test scenarios on IMPACT dataset
  •     All individual tools show improvement on state of the art                                   Faster: postcorrection with
  •     Some examples of results – there is more!                                                   Error Profiler up to 2,7 times
                                                                                                    faster than without


Better: Tested on
                           Better: hybrid line       Better: recognition of
38718 randomly
                           segmentation on 2.700     old fonts FR9→FR10                     Faster: CONCERT
selected historical
                           text lines SOA 90,9 % →   25% reduction of errors                increases correction
images and
                           IMPACT 98,8%                                                     speed up to 40%
achieved a success
of 98.93% (SoA up to                                   Better, faster:Adaptive
97.3%                                                  OCR on small testset
                                                       halves FOM (post                             Better: language
                                                       processing level required)                   resources show
                                                                                                    improvement for all 9
                                                                                                    languages
                                                                 OCR
                                                               ABBY FR                           Post correction and Enrichment
Image enhancement:
                                                             IBM Adaptive                                CONCERT IBM
Binarisation                 Segmentation and
Noise removal                                            Dictonaries / interface                       Error Profiler LMU
                             Document analysis
Geometrical defects                                            LMU,INL                           Language resources 9 partners
                            USAL, NCSR, ABBYY
correction                                             Experimental OCR engines                 Document Understanding Platform
NSCR, USAL, ABBYY                                         USAL, NCSR, UIBK                                    UIBK

                                                                                                                                  12
IMPACT: Challenges and solutions




Results: cheaper

Industry in IMPACT:
• ABBYY FR Historic Fonts Module more than 10 times cheaper; more flexible rates
   overall
• IBM Adaptive OCR and CONCERT: flexible rates

Research in IMPACT:
• Key Language resources free
• All tools by research partners free for research and free / low rates on non
   commercial use (individual licensing required), subject
   to volume, kind of use and material, support etc.

Framework:
• Digitisation Framework free and open source
• Open source wrapper to plug in other tools
• Fruitful contacts with new tool providers
IMPACT: Challenges and solutions




Benefits

For the digital library
• Rough average of all tests by developers on IMPACT dataset
  indicates consistent improvement of up to 20%
• Better access, faster and cheaper production

For the end user
• main interest: retrieval = words searched and found correctly
• Preliminary results of ABBYY FR 10 with Dutch lexicon on difficult
  material (Dutch 17th century newspaper): 15% increase of words
  found
 For 1 M words this means 150 K more words found

                                        ...and this is just the beginning!
IMPACT: Challenges and solutions

More Related Content

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

IMPACT Final Event 26-06-2012 - Summary of IMPACT project & results by Hildelies Balk (KB, IMPACT Project Director)

  • 1. Click to edit document name IMPACT: Challenges and solutions Hildelies Balk, IMPACT Project Director, KB National Library of the Netherlands
  • 2. IMPACT: Challenges and solutions Overview of this presentation • Challenges in digitisation of historical full text • IMPACT objectives • Approach • Achievements • Better, Faster, Cheaper
  • 3. IMPACT: Challenges and solutions The content • Shared vision in Europe: all cultural heritage available in digital form in this decade • Billions of pages of historical (pre-1900) text in libraries in Europe • Users expect full text to search, tag and re- use • Just image and metadata not enough 3
  • 4. IMPACT: Challenges and solutions The full text VVt Venetien den 1.Junij, Anno 1618. DJgn i f paffato te S' aö'Jifeert mo?üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te / sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe
  • 5. IMPACT: Challenges and solutions Challenges to OCR 5
  • 6. IMPACT: Challenges and solutions Language challenges Historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
  • 7. IMPACT: Challenges and solutions Answering the challenges: IMPACT IMPACT – Improving Access to Text (2008-2011) • Large-scale integrating research project • Consortium of 26 partners • Coordinated by the National Library of the Netherlands (KB) • Co-funded by EU (FP7 ICT Work Programme) Objectives: Significantly improve mass digitisation of historical printed text by: • Innovating OCR software and language technology • Sharing expertise and building capacity across Europe • Providing facilities for future research and development Making text digitisation better, faster, cheaper!
  • 8. IMPACT: Challenges and solutions IMPACT Approach • Content holders, researchers and industry work together to find solutions • Based on real life problems in digitisation • Tackle each step in the digitisation workflow from scan to full text -/-/-/-/-/- /-/-/-/-/- /-/-/-/-/- /-/-/-/-/- /-/-/-/-/- OCR Post correction and Enrichment ABBY FR CONCERT IBM Image enhancement: Preparation and IBM Adaptive Error Profiler LMU Binarisation scanning: Dictonaries/interface noise removal Language resources 9 partners guidelines and Segmentation and geometrical defects LMU,INL Platform for document  case studies Document analysis correction Experimental engines understanding based on OCR UIBK All partners NSCR,USAL, ABBYY USAL,NCSR,ABBYY USAL,NCSR,UIBK 8
  • 9. IMPACT: Challenges and solutions IMPACT Approach - continued • Tools to be coupled in Interoperability Framework • Tested with Evaluation tools and metrics • Against representative set of test data with Ground Truth • Basis for further research and development 9
  • 10. IMPACT: Challenges and solutions Achievements: summary On market: Improved ABBYY FR Engine 10, Recogition Server 3, Cloud OCR In use in productive environment: • Service for document structure recognition • Dutch and Slovene dictionary • Alethia Ready for testing in productive environment: • Adaptive OCR engine • Tools for OCR correction with volunteer involvement • Computer lexica for nine languages • Digitisation Framework with evaluation tools and dataset • Knowledge bank with guidelines and learning resources For future development: • Novel Approaches to preprocessing, OCR and post correction • New language resources with Tools for lexicon building impact Centre of Competence for digitisation • Added value: Unique network bringing together experts from different communities
  • 11. IMPACT: Challenges and solutions 11
  • 12. IMPACT: Challenges and solutions Better: rule set for extracting table of content entries from historical books outperforms best results of the Results: better & faster INEX competition 2011 • All tools evaluated in different test scenarios on IMPACT dataset • All individual tools show improvement on state of the art Faster: postcorrection with • Some examples of results – there is more! Error Profiler up to 2,7 times faster than without Better: Tested on Better: hybrid line Better: recognition of 38718 randomly segmentation on 2.700 old fonts FR9→FR10 Faster: CONCERT selected historical text lines SOA 90,9 % → 25% reduction of errors increases correction images and IMPACT 98,8% speed up to 40% achieved a success of 98.93% (SoA up to Better, faster:Adaptive 97.3% OCR on small testset halves FOM (post Better: language processing level required) resources show improvement for all 9 languages OCR ABBY FR Post correction and Enrichment Image enhancement: IBM Adaptive CONCERT IBM Binarisation Segmentation and Noise removal Dictonaries / interface Error Profiler LMU Document analysis Geometrical defects LMU,INL Language resources 9 partners USAL, NCSR, ABBYY correction Experimental OCR engines Document Understanding Platform NSCR, USAL, ABBYY USAL, NCSR, UIBK UIBK 12
  • 13. IMPACT: Challenges and solutions Results: cheaper Industry in IMPACT: • ABBYY FR Historic Fonts Module more than 10 times cheaper; more flexible rates overall • IBM Adaptive OCR and CONCERT: flexible rates Research in IMPACT: • Key Language resources free • All tools by research partners free for research and free / low rates on non commercial use (individual licensing required), subject to volume, kind of use and material, support etc. Framework: • Digitisation Framework free and open source • Open source wrapper to plug in other tools • Fruitful contacts with new tool providers
  • 14. IMPACT: Challenges and solutions Benefits For the digital library • Rough average of all tests by developers on IMPACT dataset indicates consistent improvement of up to 20% • Better access, faster and cheaper production For the end user • main interest: retrieval = words searched and found correctly • Preliminary results of ABBYY FR 10 with Dutch lexicon on difficult material (Dutch 17th century newspaper): 15% increase of words found  For 1 M words this means 150 K more words found ...and this is just the beginning!

Editor's Notes

  1. Damaged pages, bleed through, difficult layout, historic fonts …