SlideShare a Scribd company logo
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.




IMPACT Centre of Competence
Text digitisation Faster, Better, Cheaper




 Hildelies Balk and Clemens Neudecker, KB
 Workshop Recent Developments in OCR for Digital
 Libraries, Rouen 31 March 2011
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.




Overview of this presentation

 Challenges to be adressed

 The IMPACT Project and Outcomes

 The IMPACT Centre of Competence

 Get involved now!


 Twitter: @impactocr, #impactproject
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.




KB Digital Library Programme
 Goal: Offer everyone access to everything published in and about
  the Netherlands through the internet

 2013: 10% of the publications published in and about the
  Netherlands available in digital form

 Example projects:
  Historical Newspapers – http://kranten.kb.nl
  Dutch Parliamentary Papers – http://www.statengeneraaldigitaal.nl/
  Dutch Print Online – http://www.dutchprintonline.nl/

 Timeframe covered: 1618 - 1995
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.


Historical text: typical OCR
results




VVt Venetien den 1.Junij, Anno 1618.
DJgn i f paffato te S' aö'Jifeert mo?
  üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te /
sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb
  .met
beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National

OCR Challenges: damaged pages, bleed
Library of the Netherlands.




through,difficult layout, historic fonts … and
many more
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National

  Language Challenges: Spelling
  Library of the Netherlands.




  variants, orthographical variants,
  inflected forms…and more




Historical variants of the Dutch word ‘wereld’ (world):
werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt
wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels
zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts
werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts
werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National

Institutional Challenge: lack of
Library of the Netherlands.




knowledge and expertise →
inefficiency
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.


Adressing these challenges: The
IMPACT project
 Consortium of 26 partners
    Good mix of public and private partners
    Users, researchers and industry work together to find solutions
    Each established in a large international network
 Coordinated by the National Library of the Netherlands (KB)

   Large-scale Integrating Project
   EU funding: € 12 100 000 (FP7 ICT Work Programme)
   Start date: 1 January 2008 (extended feb-april 2010)
   Duration: 4 years
   From 2012: sustainable Centre of Competence with alternative resources

 Currently, over 125 people across Europe, Israel and Russia involved in
  the project
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.




    The IMPACT Vision

•   We make digitisation of historical printed
    text in Europe better, faster ,cheaper




•   We provide tools, services and facilities for
    further advancement of the State of the
    Art in this field




     Five key outcomes of the project will lead to fulfilling this vision 
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.




IMPACT Key outcomes
                                     I M P A C T C E N T R E O F C O M P E T E N C E I N T E X T D I G I T I S A T I ON



                                      LICENSING ,
                                                                         TOOL TAKE UP                          COMMITMENT
                                      TERMS AND
                                                                       IMPACT LIBRARIES                      IMPACT PARTNERS
                                      CONDITIONS


                                                                                                                  INTEROPERABILITY FRAMEWORK
                                              Framework components


                                                                                                                          Demonstration
                                                                                                                             results
                                                                     Library expectations


                               Datasets and
                               Ground Truth
                                                                                                                     Evaluation tools and
                                                                        Requirements                                       metrics




                Means to build
                                                                                                                              Tools that significantly
             digitisation capacity                        Modular suite of tools for improved text recognition
                                                                                                                             enhance SOA on different
               Web Community                                                       (mass)                                   aspects of Text Recognition
                                                                                                                                     Workflow
            Decision Support Tools
                                                              ABBYY FR 10              Functional Extension Parser          Image enhancement toolkit
                                          Documen-
               Knowledge Base               tation
                                                                                                Lexical                         Segmentation toolkit
                                                           IBM Adaptive OCR
                   Helpdesk                                                                    Resources

                   Learning                                                                                                   Experimental prototypes
                  Resources                                    CONCERT                  Post correction modules

             Training Programme                                                                                               Tools for lexicon building
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.

1. A modular suite of tools and resources to improve text
recognition, ready for implementation in a mass digitisation
workflow




                                                                                                      CONCERT:
                                                                                                      OCR
                                                                                                      Correction
                                                                                                      with volunteer
                                                                                                      involvment
                                                                                           Functional
                                                                                           Extension
                                                                                           Parser:
                                                                                           structural
                                                                                           metadata
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
 Library of the Netherlands.


 2. Research prototypes that significantly advance the
 state of the art of research in text recognition




Platform
Enhancement
and
segmentation
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
    3. A free and Open Source Interoperability Framework
    Library of the Netherlands.


         with tools and
         resources for evaluating and demonstrating
         results




Facilities for
 wrapping all tools as web services,
 creating workflows with both IMPACT- and external
   tools
 instruments and resources for demonstrating and
   evaluating results
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.




Tools & Applications

   OCR (C++, C#),
   Image Processing & Lexica (DLL),
   Command Line Tools (Win/Linux),
   Java, PHP, Perl, etc.
    + 3rd party software!

    “One ring to rule them all...”


 IMPACT Interoperability Framework (IIF)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.




Technical Framework

    Java 6
    Apache Axis2
    Apache Tomcat
    Apache httpd (optional)

 Focus on interoperability
 Web based, platform independent
 Highly standardised (SOAP/WSDL)
 Easy deployment (e.g. hot deployment & update)
 Open source (Apache License 2.0, LGPL)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.




Infrastructure
 Clustering
  (load balancing &
  fail over)

 Monitoring (soapUI)

 Central Results
  Repository (WebDAV)

 HTTPS encryption &
  authentication
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.




Integration: Workflows

                                    OCR workflow =
                                     data pipeline

                                    Building blocks =
                                     processing steps (nodes)

                                    Integration =
                                     interaction between nodes

                                    Collaboration with myGrid
                                        (paper@TPDL2011)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.




Workflow management
 Web 2.0 style registry: myExperiment

 Local client: Taverna Workbench

 Web client: project website
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.



Evaluation Framework
                                              Evaluation
                                               Results
  Evaluation                                                                                  Evaluation
  Scenarios                                                                                    Metrics

                                                Evaluation
                                                  Tools


     Results                          Compatibility through one                            Ground Truth
                                      common format


  DIA / OCR
                                                                                               GT Tools
    Tools                                        Image
                                                Repository
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.




Evaluation
    Tool A vs Tool B
    Tool A(v1) vs Tool A(v2)
    Workflow X (Tool A + Tool B) vs Workflow Y (Tool A + Tool C)
    Workflow X vs previously digitised material




 Users can identify optimal workflow for source material
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.




Benefits

• Modular

• Flexible

• Transparent

• Expandable



 Potential use: Production platform, evaluation framework,
  long-term preservation system - and many more!
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.


4. The means for building digitisation
capacity in Europe
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
   Library of the Netherlands.


    5. A Centre of Competence in text digitisation with a
    business model that can sustain itself for 3 years

Three Customer Segments



institutions                       institutions                         have products and
                                                                        services that they wish
holding                            and individuals                       to make available to the
historical text-                   engaged in                           Content Holders and

based content                      research in all                      Researchers

that they wish                     areas within                         primarily private sector
to digitise.                       the scope of
                                   the IMPACT
                                   project.
both private and                   both private
public                             and public                                                                                   24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
    Library of the Netherlands.




   To be launched Q3 of 2011
   One stop shop for all content holders in Europe
   Main objective: delivering faster, better and cheaper digitisation of text
   Multi-sided platform, delivering different product and services to different customer
    segments
   will employ a Freemium business model
      offering basic products and services for free,
      charging for premium or special features
      facilitating income generation to aid sustainability

       GET INVOLVED NOW
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.


LinkedIn group: IMPACT Improving Access
to Text
                                                                                       Building an online
                                                                                        community

                                                                                       Collecting feedback
                                                                                        on IMPACT
                                                                                        deliverables
                                                                                        (to be incorporated in
                                                                                        later versions)

                                                                                       Discussions on topics
                                                                                        related to
                                                                                        digitisation, OCR &
                                                                                        language technology
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.


IMPACT final conference: 24-25
October 2011
Digitisation & OCR: Better, faster, cheaper
  Solutions of the IMPACT Centre of Competence and future
  challenges

 Presentation of final results of IMPACT & related research in
  the area of OCR, digitisation and language technology

 Location: The British Library, London, UK
 Registration and more news available through the IMPACT
  website
 Early bird fee until June 2011
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National
Library of the Netherlands.



                      www.impact-project.eu



                     Twitter: @impactocr,
                     #impactproject

More Related Content

What's hot (7)

Resume
ResumeResume
Resume
 
UAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligenceUAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligence
 
Image semantic coding using OTB
Image semantic coding using OTBImage semantic coding using OTB
Image semantic coding using OTB
 
Semantic Search Trend
Semantic Search TrendSemantic Search Trend
Semantic Search Trend
 
Interactive Knowledge Stack - A Software Architecture for Semantic Content Ma...
Interactive Knowledge Stack - A Software Architecture for Semantic Content Ma...Interactive Knowledge Stack - A Software Architecture for Semantic Content Ma...
Interactive Knowledge Stack - A Software Architecture for Semantic Content Ma...
 
A probabilistic model for recursive factorized image features
A probabilistic model for recursive factorized image featuresA probabilistic model for recursive factorized image features
A probabilistic model for recursive factorized image features
 
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
 

Viewers also liked

Cisco academy presentation 2011-12-15
Cisco academy presentation 2011-12-15Cisco academy presentation 2011-12-15
Cisco academy presentation 2011-12-15
Nösnäsgymnasiet
 

Viewers also liked (8)

Pes grundläggande program sve hela
Pes grundläggande program sve helaPes grundläggande program sve hela
Pes grundläggande program sve hela
 
Cisco academy presentation 2011-12-15
Cisco academy presentation 2011-12-15Cisco academy presentation 2011-12-15
Cisco academy presentation 2011-12-15
 
Kompetenspartner - Alpha Competence
Kompetenspartner - Alpha CompetenceKompetenspartner - Alpha Competence
Kompetenspartner - Alpha Competence
 
Morgendagens helsearbeidere i nord, Martin Burman
Morgendagens helsearbeidere i nord, Martin BurmanMorgendagens helsearbeidere i nord, Martin Burman
Morgendagens helsearbeidere i nord, Martin Burman
 
Kompetenshöjning - Vad är det
Kompetenshöjning - Vad är detKompetenshöjning - Vad är det
Kompetenshöjning - Vad är det
 
The E-Learning Competence Center
The E-Learning Competence CenterThe E-Learning Competence Center
The E-Learning Competence Center
 
CONCERT IMPACT by Lotte Wilms
CONCERT IMPACT by Lotte WilmsCONCERT IMPACT by Lotte Wilms
CONCERT IMPACT by Lotte Wilms
 
7. Technical development at the Meertens Institute. Marc Kemps Snijders.
7. Technical development at the Meertens Institute. Marc Kemps Snijders.7. Technical development at the Meertens Institute. Marc Kemps Snijders.
7. Technical development at the Meertens Institute. Marc Kemps Snijders.
 

Similar to Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]

Livas Eurobank
Livas EurobankLivas Eurobank
Livas Eurobank
knowhowgr
 
Ow2 Open World Forum09 Trustie Project
Ow2 Open World Forum09 Trustie ProjectOw2 Open World Forum09 Trustie Project
Ow2 Open World Forum09 Trustie Project
OW2
 
Trustie Forge Solutions Linux Ow2
Trustie Forge Solutions Linux Ow2Trustie Forge Solutions Linux Ow2
Trustie Forge Solutions Linux Ow2
OW2
 
Desktop publishing units of study
Desktop publishing    units of studyDesktop publishing    units of study
Desktop publishing units of study
lc0810
 
The SENSORIA Development Environment
The SENSORIA Development EnvironmentThe SENSORIA Development Environment
The SENSORIA Development Environment
Istvan Rath
 
Scct2013 topic5-introto applicationdevelopment
Scct2013 topic5-introto applicationdevelopmentScct2013 topic5-introto applicationdevelopment
Scct2013 topic5-introto applicationdevelopment
Anies Syahieda
 

Similar to Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1] (20)

Impact centre of competence hildelies balk
Impact centre of competence hildelies balkImpact centre of competence hildelies balk
Impact centre of competence hildelies balk
 
OER Rapid Innovation
OER Rapid InnovationOER Rapid Innovation
OER Rapid Innovation
 
Livas Eurobank
Livas EurobankLivas Eurobank
Livas Eurobank
 
Improve Foundations (EN)
Improve Foundations (EN)Improve Foundations (EN)
Improve Foundations (EN)
 
DashMash: a Mashup Environment for End User Development
DashMash: a Mashup Environment for End User DevelopmentDashMash: a Mashup Environment for End User Development
DashMash: a Mashup Environment for End User Development
 
Timeliner poster on CSCW 2012 conference
Timeliner poster on CSCW 2012 conferenceTimeliner poster on CSCW 2012 conference
Timeliner poster on CSCW 2012 conference
 
ECLAP Tutorial first part, ECLAP 2012 conference. the general overview
ECLAP Tutorial first part, ECLAP 2012 conference. the general overviewECLAP Tutorial first part, ECLAP 2012 conference. the general overview
ECLAP Tutorial first part, ECLAP 2012 conference. the general overview
 
2011-11-07 Open PHACTS Poster
2011-11-07 Open PHACTS Poster2011-11-07 Open PHACTS Poster
2011-11-07 Open PHACTS Poster
 
TR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig MaximiliansTR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig Maximilians
 
Ow2 Open World Forum09 Trustie Project
Ow2 Open World Forum09 Trustie ProjectOw2 Open World Forum09 Trustie Project
Ow2 Open World Forum09 Trustie Project
 
Leadership Symposium on Digital Media in Healthcare
Leadership Symposium on Digital Media in HealthcareLeadership Symposium on Digital Media in Healthcare
Leadership Symposium on Digital Media in Healthcare
 
Trustie Forge Solutions Linux Ow2
Trustie Forge Solutions Linux Ow2Trustie Forge Solutions Linux Ow2
Trustie Forge Solutions Linux Ow2
 
LEAD - Learning Design – Design For Learning -project presentation
LEAD - Learning Design – Design For Learning -project presentationLEAD - Learning Design – Design For Learning -project presentation
LEAD - Learning Design – Design For Learning -project presentation
 
Junos Space SDK - Imagination, Ideas, Innovation
Junos Space SDK - Imagination, Ideas, InnovationJunos Space SDK - Imagination, Ideas, Innovation
Junos Space SDK - Imagination, Ideas, Innovation
 
51 etna
51 etna51 etna
51 etna
 
Enterprise 2.0 Musings
Enterprise 2.0 MusingsEnterprise 2.0 Musings
Enterprise 2.0 Musings
 
Desktop publishing units of study
Desktop publishing    units of studyDesktop publishing    units of study
Desktop publishing units of study
 
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de JonghIMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
 
The SENSORIA Development Environment
The SENSORIA Development EnvironmentThe SENSORIA Development Environment
The SENSORIA Development Environment
 
Scct2013 topic5-introto applicationdevelopment
Scct2013 topic5-introto applicationdevelopmentScct2013 topic5-introto applicationdevelopment
Scct2013 topic5-introto applicationdevelopment
 

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

Recently uploaded (20)

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 

Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Centre of Competence Text digitisation Faster, Better, Cheaper Hildelies Balk and Clemens Neudecker, KB Workshop Recent Developments in OCR for Digital Libraries, Rouen 31 March 2011
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Overview of this presentation  Challenges to be adressed  The IMPACT Project and Outcomes  The IMPACT Centre of Competence  Get involved now!  Twitter: @impactocr, #impactproject
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. KB Digital Library Programme  Goal: Offer everyone access to everything published in and about the Netherlands through the internet  2013: 10% of the publications published in and about the Netherlands available in digital form  Example projects: Historical Newspapers – http://kranten.kb.nl Dutch Parliamentary Papers – http://www.statengeneraaldigitaal.nl/ Dutch Print Online – http://www.dutchprintonline.nl/  Timeframe covered: 1618 - 1995
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Historical text: typical OCR results VVt Venetien den 1.Junij, Anno 1618. DJgn i f paffato te S' aö'Jifeert mo? üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te / sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National OCR Challenges: damaged pages, bleed Library of the Netherlands. through,difficult layout, historic fonts … and many more
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Language Challenges: Spelling Library of the Netherlands. variants, orthographical variants, inflected forms…and more Historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Institutional Challenge: lack of Library of the Netherlands. knowledge and expertise → inefficiency
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Adressing these challenges: The IMPACT project  Consortium of 26 partners  Good mix of public and private partners  Users, researchers and industry work together to find solutions  Each established in a large international network  Coordinated by the National Library of the Netherlands (KB)  Large-scale Integrating Project  EU funding: € 12 100 000 (FP7 ICT Work Programme)  Start date: 1 January 2008 (extended feb-april 2010)  Duration: 4 years  From 2012: sustainable Centre of Competence with alternative resources  Currently, over 125 people across Europe, Israel and Russia involved in the project
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The IMPACT Vision • We make digitisation of historical printed text in Europe better, faster ,cheaper • We provide tools, services and facilities for further advancement of the State of the Art in this field Five key outcomes of the project will lead to fulfilling this vision 
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Key outcomes I M P A C T C E N T R E O F C O M P E T E N C E I N T E X T D I G I T I S A T I ON LICENSING , TOOL TAKE UP COMMITMENT TERMS AND IMPACT LIBRARIES IMPACT PARTNERS CONDITIONS INTEROPERABILITY FRAMEWORK Framework components Demonstration results Library expectations Datasets and Ground Truth Evaluation tools and Requirements metrics Means to build Tools that significantly digitisation capacity Modular suite of tools for improved text recognition enhance SOA on different Web Community (mass) aspects of Text Recognition Workflow Decision Support Tools ABBYY FR 10 Functional Extension Parser Image enhancement toolkit Documen- Knowledge Base tation Lexical Segmentation toolkit IBM Adaptive OCR Helpdesk Resources Learning Experimental prototypes Resources CONCERT Post correction modules Training Programme Tools for lexicon building
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 1. A modular suite of tools and resources to improve text recognition, ready for implementation in a mass digitisation workflow CONCERT: OCR Correction with volunteer involvment Functional Extension Parser: structural metadata
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 2. Research prototypes that significantly advance the state of the art of research in text recognition Platform Enhancement and segmentation
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National 3. A free and Open Source Interoperability Framework Library of the Netherlands. with tools and resources for evaluating and demonstrating results Facilities for  wrapping all tools as web services,  creating workflows with both IMPACT- and external tools  instruments and resources for demonstrating and evaluating results
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools & Applications  OCR (C++, C#),  Image Processing & Lexica (DLL),  Command Line Tools (Win/Linux),  Java, PHP, Perl, etc. + 3rd party software! “One ring to rule them all...”  IMPACT Interoperability Framework (IIF)
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Technical Framework  Java 6  Apache Axis2  Apache Tomcat  Apache httpd (optional)  Focus on interoperability  Web based, platform independent  Highly standardised (SOAP/WSDL)  Easy deployment (e.g. hot deployment & update)  Open source (Apache License 2.0, LGPL)
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Infrastructure  Clustering (load balancing & fail over)  Monitoring (soapUI)  Central Results Repository (WebDAV)  HTTPS encryption & authentication
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Integration: Workflows  OCR workflow = data pipeline  Building blocks = processing steps (nodes)  Integration = interaction between nodes  Collaboration with myGrid (paper@TPDL2011)
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflow management  Web 2.0 style registry: myExperiment  Local client: Taverna Workbench  Web client: project website
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation Framework Evaluation Results Evaluation Evaluation Scenarios Metrics Evaluation Tools Results Compatibility through one Ground Truth common format DIA / OCR GT Tools Tools Image Repository
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation  Tool A vs Tool B  Tool A(v1) vs Tool A(v2)  Workflow X (Tool A + Tool B) vs Workflow Y (Tool A + Tool C)  Workflow X vs previously digitised material  Users can identify optimal workflow for source material
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Benefits • Modular • Flexible • Transparent • Expandable  Potential use: Production platform, evaluation framework, long-term preservation system - and many more!
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 4. The means for building digitisation capacity in Europe
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 5. A Centre of Competence in text digitisation with a business model that can sustain itself for 3 years Three Customer Segments institutions institutions have products and services that they wish holding and individuals to make available to the historical text- engaged in Content Holders and based content research in all Researchers that they wish areas within primarily private sector to digitise. the scope of the IMPACT project. both private and both private public and public 24
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.  To be launched Q3 of 2011  One stop shop for all content holders in Europe  Main objective: delivering faster, better and cheaper digitisation of text  Multi-sided platform, delivering different product and services to different customer segments  will employ a Freemium business model  offering basic products and services for free,  charging for premium or special features  facilitating income generation to aid sustainability GET INVOLVED NOW
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. LinkedIn group: IMPACT Improving Access to Text  Building an online community  Collecting feedback on IMPACT deliverables (to be incorporated in later versions)  Discussions on topics related to digitisation, OCR & language technology
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT final conference: 24-25 October 2011 Digitisation & OCR: Better, faster, cheaper Solutions of the IMPACT Centre of Competence and future challenges  Presentation of final results of IMPACT & related research in the area of OCR, digitisation and language technology  Location: The British Library, London, UK  Registration and more news available through the IMPACT website  Early bird fee until June 2011
  • 28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. www.impact-project.eu Twitter: @impactocr, #impactproject