Wroclaw University Library presentation at "Succeed in Digitisation. Spreading Excellence" Conference. Validation and take-up of text digitisation tools.
Lexigraf is a multilingual lexicography Desktop Publishing engine, design in the late 90s by Yiannis Hatzopoulos. It was used to bring to the market a 4 language natural sciences dictionary ISBN 960-12-1276-0, ISBN-13 978-960-12-1276-0
Lexigraf is a multilingual lexicography Desktop Publishing engine, design in the late 90s by Yiannis Hatzopoulos. It was used to bring to the market a 4 language natural sciences dictionary ISBN 960-12-1276-0, ISBN-13 978-960-12-1276-0
Slides from Clemens Neudecker's presentation on the IMPACT Interoperability and Evaluation Framework within the IMPACT project at the British Library Demo-day on the 12th July 2011.
ABCD Open Source Software for managing ETD repositoriessangeetadhamdhere
Paper presented at 16th International Symposium on Electronic Theses and Dissertation conducted by The University of Hongkong Libraries, Hongkong on 24th September 2013.
Slides of the paper OCR-D: An end-to-end open-source OCR framework for historical documents by Clemens Neudecker, Konstantin Baierer, Maria Federbusch, Kay-Michael Würzner, Matthias Boenig, Elisa Hermann and Volker Hartmann at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts by Helmut Schmid at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model by Hsiang-An Wang and Pin-Ting Liu at the 3rd Edition of the DATeCH2019 International Conference
More Related Content
Similar to Wroclaw university library - Grazyna Piotrowicz
Slides from Clemens Neudecker's presentation on the IMPACT Interoperability and Evaluation Framework within the IMPACT project at the British Library Demo-day on the 12th July 2011.
ABCD Open Source Software for managing ETD repositoriessangeetadhamdhere
Paper presented at 16th International Symposium on Electronic Theses and Dissertation conducted by The University of Hongkong Libraries, Hongkong on 24th September 2013.
Slides of the paper OCR-D: An end-to-end open-source OCR framework for historical documents by Clemens Neudecker, Konstantin Baierer, Maria Federbusch, Kay-Michael Würzner, Matthias Boenig, Elisa Hermann and Volker Hartmann at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts by Helmut Schmid at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model by Hsiang-An Wang and Pin-Ting Liu at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project by Katrien Depuydt and Hennie Brugman at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Standoff Annotation for the Ancient Greek and Latin Dependency Treebank by Giuseppe Celano at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Using lexicography to characterise relations between species mentions in the biodiversity literature by Sandra Young at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability by Evagelos Varthis, Marios Poulos, Ilias Yarenis and Sozon Papavlasopoulos at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Curation Technologies for a Cultural Heritage Archive: Analysing and transforming a heterogeneous data set into an interactive curation workbench by Georg Rehm, Martin Lee, Julián Moreno Schneider and Peter Bourgonje at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Cross-disciplinary collaborations to enrich access to non-Western language material in the Cultural Heritage sector by Tom Derrick and Nora McGregor at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Tribunal Archives as Digital Research Facility (TRIADO): new ways to make archives accessible and useable by Anne Gorter, Edwin Klijn, Rutger Van Koert, Marielle Scherer and Ismee Tames at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Improving OCR of historical newspapers and journals published in Finland by Senka Drobac, Pekka Kauppinen and Krister Lindén at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Towards a generic unsupervised method for transcription of encoded manuscripts by Arnau Baró, Jialuo Chen, Alicia Fornés and Beáta Megyesi at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Towards the Extraction of Statistical Information from Digitised Numerical Tables - The Medical Officer of Health Reports Scoping Study by Christian Clausner, Apostolos Antonacopoulos, Christy Henshaw and Justin Hayes at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software by Kimmo Kettunen, Teemu Ruokolainen, Erno Liukkonen, Pierrick Tranouez, Daniel Antelme and Thierry Paquet at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Diamonds in Borneo: Commodities as Concepts in Context by Karin Hofmeester, Ashkan Ashkpour, Katrien Depuydt and Jesse de Does at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Automatic Reconstruction of Emperor Itineraries from the Regesta Imperii by Juri Opitz, Leo Born, Vivi Nastase and Yannick Pultar at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification by Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner and Frank Puppe at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Arabic-SOS Segmenter, Stemmer and Orthography Standardizer for the Arabic Cultural Heritage by Emad Mohamed & Zeeshas Sayyed at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper A-I-PoCoTo - Combining Automated and Interactive OCR PostCorrection by Tobias Englmeier, Florian Fink and Klaus U. Schulz at the 3rd Edition of the DATeCH2019 International Conference
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
2. Wroclaw University Library:
1. is one of the bigest academic libraries in Poland. Its collection has ca 2,4 million of volumes and in that number 0,5 million of special collections‟ items (i.e. manuscriptes, old printed books, incunabula, maps, graphic collecion, music collection, etc.);
2.is a member of : IFLA, CERL, IAML, Technical Committee No 242 (for Information and Documentation) at Polish Committee for Standardization;
3.has participated in many research projects (European, international, national, etc.);
4.has the staff team with the long-standing experience in digitisation of printed items as well as processing and then presentation of digital objects;
5.has started the digitisation of own physical resources since the year 2000 , has initiated the Digital Library of University of Wroclaw (DLUW) in 2005 and in 2013/2014 – the university repository (Repository of University of Wroclaw – RUW); Owing to the appropriate policy of human resources development, purchases of optical & electronic equipment and computers (hardware & software) as well as participation in many projects the Wroclaw University Library has at its disposal experienced staff and technological base that enable it the cooperation in the framework of the Impact Centre of Competence in Digitisation.
3. Use Case and Tools
In order to improve digitisation workflow in DLUW it was required to implement tools that can help to speed – up and optimize the processes.
For that pourpose two tools have been tested. First, Scan Tailor software was chosen as the post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, etc. It was used for raw scans, and enabled to receive pages ready to be printed or assembled into a PDF or DjVu files.
The second one was Tesseract OCR software - open source OCR engine that combined with the Leptonica Image Processing Library can read a wide variety of image formats and convert them to text in over 60 languages.
Both tools were tested while preparing presentation versions of chosen 12 old printed books (from 16th to18th century), all only with the single-column text layout, printed in different languages (e.g. Latin, Italian, German, Romance) and with different font types (e.g. Gothic, Roman). The aim of tests intended was working out the technological line and workflow for digitisation, processing and presentation of good quality delivery files in the DLUW. For the evaluation the ground truth in plain text format was used (5 pages from every marked out document).
The evaluation was performed by: 1.comparing OCR with ground truth and measuring character error rate, 2. comparing OCR with ground truth and measuring word error rate; 3. comparing OCR from different engines.
4. Use Case and Tools
The research proccess was realized on server in 3 following steps:
1st step – the execution of Scan Tailor program with default adjustments.
After the processing had been done by Scan Tailor program the visual control and manual correction of wrongly processed files had to be carried out by the operator.
Owing to that operation it was possible to improve the parameters of the later processing to the satisfying level. We wanted to receive the best quality of „post master” files for the future processing by OCR and aesthetic digital presentations of the originals in DLUW.
2nd step – saving manual corrections on the server. On the server were saved only these files, that had to be corrected by the operator. The rest of the results of Scan Tailor „s automation operations remained without changes. For supporting the realization of 2nd step the dedicated Web site on server was applied.
3rd step – execution of Tesseract program. Earlier, the appropriate dictionaries were chosen. We used only the dictioneries which were available with Tesseract software and no additional training tools were applied. It turned out that small size of fonts were the great problems for Tesseract. Additionally, it does not have the tools that enable to point out with precision the text layout and to separate it from the area of graphics. The lack of such a function results in the attempts to apply the text recognition function for graphical objects, like: frames, floratura, seals, etc.
5. Evaluation Results
The implementation of new solution consisting in the integration of dispersed digitisation processes and data processing can significantly decrease the costs and increase the efficiency of digital resources‟ creation in the DLUW. The tests carried out on the Scan Tailor and Tesseract programs are of great importance for preparing and organizing technological line for data processing in cloud. It is necessary to work out the procedures and interfaces which enable supporting of the remote processes by our staff.
In the case of Scan Tailor program it is possible to carry out automatically and efficiently the following tasks: splitting master files into the single pages, turning split pages in order to level the text, removing of margins and rejection of artifacts, generating of files to be prepared for OCR process. The only problem is an appropriate recognition of the text area. That problem causes this task not to be solved automatically without carrying out any control process. That imperfection does not disparage Scan Tailor program and it will be applied in WUL as an important tool in the process of data processing.
The Teseract program seems to be very promising tool and with absolute certainty can be said that trials will be done to implement it for supporting digitisation process of selected types of library materials. It is essential however to refine and improve the quality of document‟s layout analysis as well as the recognition of graphical elements and small fonts.
6. Evaluation Results
The results of text recognition can be saved as the files: “txt” or hocr”. File “hocr” contains the following data: the recognized text, its location relative to the original image, style. These data are saved by means of XML in form of HTML or XHTML file.
Taking into account the needs of archiving process the „hocr” files seem to be good form of files‟ saving. Each “hocr” file is assigned to specific graphic file. In this way the adjustment of particular pages of document is possible and thus the organization of adjustment process can be more flexible. The creation of hybrid publications (PDF, DjVu) can be executed automatically by server. „hocr” files can be a base for the further preparation of electronic publications. We noticed the potential of that solution and the tools created during the project we are going to use in the near future.
Additionally, when we were carrying out the other project connected with processing of 19th – century newspapers printed in gothic fonts we observed very satisfying OCR results received by means of Tesseract used on the objects processed to the 1-bit version (black/white). http://www.bibliotekacyfrowa.pl/publication/59368.
We have also repeated the recognition of samples of the object 319708 from the prepared monochromatic files (1-bit). The Tesseract results: CER 7,80% and WER 19,67% vs Tesseract results from our final report: CER 20,58% and WER 35,56%.
So, it turned out that creation of good black-white image is essential element which very positively influences on the OCR‟s results.