An Integrated System For Generating And Correcting Polytonic Greek OCR
Federico Boschetti (CNR, Pisa) and Bruce Robertson (Mount Allison University, Canada)
Digital Classicist London & Institute of Classical Studies seminar 2013
Friday July 19th at 16:30, in Room S264, Senate House, Malet Street, London WC1E 7HU
In many fields, the digital books revolution provides wide and highly detailed access to pertinent texts; but this revolution has left behind scholars working with ancient Greek. While it is true that Hellenists have had digitized canonical texts for many years, these collections' relatively limited scope and restrictive licenses are increasingly at odds with recent currents in computer-based humanities research: linked data, large-scale text mining, and syntatic treebanking, to name a few. Perhaps the most important impediments to digitizing polytonic Greek have been the lack of: a high-quality optical character recognition for this script, especially under open-source licenses; and an assisted editor for polytonic Greek OCR output. In this seminar, we present a integrated system that fills these critical gap, making it possible for polytonic Greek texts to be digitized en masse.
Rigaudon OCR is a complete suite of scripts, python code and data required for producing polytonic Greek OCR. It comprises: an OCR engine based on Gamera with many features specific to the recognition of polytonic Greek and specific classifiers to identify the characters in Teubner, Teubner-sans-serif, OCT/Loeb, and Didot editions. It includes an automatic spellchecker designed to correct Greek OCR errors, and it has a process for combining existing, high-quality Latin-script OCR output with parallel Greek output, as illustrated by this papyrological text. Finally, it coordinates these processes through Sun Grid Engine scripts required to queue and parallelize these processes.
You can fetch 1, 2, 3 BHK Apartments, Builder Floors for Rent in South Delhi, under the guidance of the best agency
which has wealth experience in the Renting field.
There are 12 months in a year. The months are January, February, March, April, May, June, July, August, September, October, November, and December. The document asks questions about specific months, like what is the first month (January), second month (February), third month (March), sixth month (June), seventh month (July), and twelfth month (December). It also asks about the next month after June (July) and the current month after April (May).
This document outlines 5 keys to building a successful digital business platform:
1. Provide functionalities and agility to quickly create digital experiences while retaining creative freedom.
2. Ensure flexibility and easy integration of different digital components and back-office systems.
3. Develop in the cloud to ensure high performance and ability to handle large volumes of traffic.
4. Connect digital experiences across multiple channels from web to mobile to stores for an omnichannel experience.
5. Personalize experiences for each user to increase engagement, conversions and performance.
This document provides tips for running an effective TV Room on the Beamly platform. It recommends setting up an engaging profile and room description, posting quality content to start conversations, using features like pinning posts and tagging users to drive engagement. It also emphasizes responding to users to continue discussions and being mindful of spoilers for different time zones. The overall goal is to build an active community around shows by providing an entertaining space for fans.
Digital Classicist London Seminars 2013 - Seminar 6 (part 1) - Eleni Bozia DigitalClassicistLondon
This document describes a web-based application called the Digital Epigraphic Archive (DEA) that facilitates the preservation, study, and dissemination of ancient inscriptions and archaeological artifacts. The DEA uses a low-cost method to digitize paper squeezes of inscriptions by scanning them twice with different light sources and then uses computer vision techniques to reconstruct the 3D surface and perform automated epigraphic analysis, including letter segmentation, grouping, and clustering. The DEA was tested on fragments from the archaeological site of Epidauros and was able to accurately reconstruct 3D surfaces and analyze letterforms.
Dot Porter (University of Pennsylvania)
'The Medieval Electronic Scholarly Alliance: a federated platform for discovery and research'.
Digital Classicist London & Institute of Classical Studies seminar 2013, Friday July 5th.
The Medieval Electronic Scholarly Alliance (MESA) is a federated international community of scholars, project, institutions, and organizations engaged in digital scholarship within the field of medieval studies. Funded by the Andrew W. Mellon Foundation, MESA seeks both to provide a community for those engaged in digital medieval studies and to meet emerging needs of this community, including making recommendations on technological and scholarly standards for electronic scholarship, the aggregation of data, and the ability to discover and repurpose this data.
This presentation will focus on the discovery aspect of MESA, and how it might serve the non-digital medievalist who may nevertheless be interested in finding and using digital resources. Starting with a history of medievalists and their interactions with digital technology as told through three data sets (the International Congress on Medieval Studies (first held in 1962), arts-humanities.net (a digital project database in the UK, sponsored by JISC and the Arts & Humanities Research Council), and two surveys, from 2002 and 2011, that looked specifically at medievalists' use of digital resources), I will draw out some potential issues that this history has for the current developers of digital resources for medievalists, and investigate how MESA might serve to address these issues.
The document discusses the evolution of mobile technologies from 1G to 4G. It provides details on the key features and limitations of each generation. 1G used analog signals and had poor voice quality. 2G introduced digital modulation and provided improved audio quality. 2.5G enabled basic data services. 3G supported a wider range of advanced services like video calls and broadband data. 4G aims to provide faster speeds up to 100 Mbps and support always-on connectivity and rich multimedia services. However, implementing next generation networks remains expensive for carriers.
You can fetch 1, 2, 3 BHK Apartments, Builder Floors for Rent in South Delhi, under the guidance of the best agency
which has wealth experience in the Renting field.
There are 12 months in a year. The months are January, February, March, April, May, June, July, August, September, October, November, and December. The document asks questions about specific months, like what is the first month (January), second month (February), third month (March), sixth month (June), seventh month (July), and twelfth month (December). It also asks about the next month after June (July) and the current month after April (May).
This document outlines 5 keys to building a successful digital business platform:
1. Provide functionalities and agility to quickly create digital experiences while retaining creative freedom.
2. Ensure flexibility and easy integration of different digital components and back-office systems.
3. Develop in the cloud to ensure high performance and ability to handle large volumes of traffic.
4. Connect digital experiences across multiple channels from web to mobile to stores for an omnichannel experience.
5. Personalize experiences for each user to increase engagement, conversions and performance.
This document provides tips for running an effective TV Room on the Beamly platform. It recommends setting up an engaging profile and room description, posting quality content to start conversations, using features like pinning posts and tagging users to drive engagement. It also emphasizes responding to users to continue discussions and being mindful of spoilers for different time zones. The overall goal is to build an active community around shows by providing an entertaining space for fans.
Digital Classicist London Seminars 2013 - Seminar 6 (part 1) - Eleni Bozia DigitalClassicistLondon
This document describes a web-based application called the Digital Epigraphic Archive (DEA) that facilitates the preservation, study, and dissemination of ancient inscriptions and archaeological artifacts. The DEA uses a low-cost method to digitize paper squeezes of inscriptions by scanning them twice with different light sources and then uses computer vision techniques to reconstruct the 3D surface and perform automated epigraphic analysis, including letter segmentation, grouping, and clustering. The DEA was tested on fragments from the archaeological site of Epidauros and was able to accurately reconstruct 3D surfaces and analyze letterforms.
Dot Porter (University of Pennsylvania)
'The Medieval Electronic Scholarly Alliance: a federated platform for discovery and research'.
Digital Classicist London & Institute of Classical Studies seminar 2013, Friday July 5th.
The Medieval Electronic Scholarly Alliance (MESA) is a federated international community of scholars, project, institutions, and organizations engaged in digital scholarship within the field of medieval studies. Funded by the Andrew W. Mellon Foundation, MESA seeks both to provide a community for those engaged in digital medieval studies and to meet emerging needs of this community, including making recommendations on technological and scholarly standards for electronic scholarship, the aggregation of data, and the ability to discover and repurpose this data.
This presentation will focus on the discovery aspect of MESA, and how it might serve the non-digital medievalist who may nevertheless be interested in finding and using digital resources. Starting with a history of medievalists and their interactions with digital technology as told through three data sets (the International Congress on Medieval Studies (first held in 1962), arts-humanities.net (a digital project database in the UK, sponsored by JISC and the Arts & Humanities Research Council), and two surveys, from 2002 and 2011, that looked specifically at medievalists' use of digital resources), I will draw out some potential issues that this history has for the current developers of digital resources for medievalists, and investigate how MESA might serve to address these issues.
The document discusses the evolution of mobile technologies from 1G to 4G. It provides details on the key features and limitations of each generation. 1G used analog signals and had poor voice quality. 2G introduced digital modulation and provided improved audio quality. 2.5G enabled basic data services. 3G supported a wider range of advanced services like video calls and broadband data. 4G aims to provide faster speeds up to 100 Mbps and support always-on connectivity and rich multimedia services. However, implementing next generation networks remains expensive for carriers.
Internationalization and localization allow PHP applications to support multiple human languages. Internationalization makes an application structured to support multiple locales, while localization adds specific language translations. Common PHP internationalization approaches include message catalogs, storing translations in databases or JSON files. New services like LingoHub automate the process and improve collaboration between developers and translators.
Presentation of the paper PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text by Thorsten Vobl, Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter and Klaus Schulz in DATeCH 2014. #digidays
The document describes CONCERT, an adaptive collaborative correction platform for digitized text. It uses feedback from users to improve optical character recognition and increase productivity of post-correction. Key features include adaptive OCR, quality control tools, productivity tools like games to motivate volunteers, and monitoring of users to prevent data corruption. It has been used successfully in several library digitization projects worldwide.
This document provides an overview of machine translation and the Moses machine translation toolkit. It defines machine translation and statistical machine translation. It describes the major components of Moses, including GIZA++ for word alignment, SRILM for language modeling, and the Moses decoder. It explains how Moses uses phrase-based translation and tuning to produce translations. It also discusses how to set up and use a Moses server for translating webpages.
The document discusses the methods and resources used in the dictionary writing process. It describes the stages of analysis, transfer, and synthesis used to compile dictionary entries from a corpus. Specific software tools are used at each stage, including a corpus query system to analyze sample text, and a dictionary writing system for editors to compile and edit entries stored in a database. The advantages of these tools include guiding the analysis process, creating a comprehensive record of words, and streamlining the editorial workflow.
Parallel data has become an extremely valuable resource, not only for building new statistical machine translation systems, but also for building other useful resources for translators, such as bilingual concordancers, translation memories or bilingual lexicons. One of the most important and under-exploited sources of bilingual information is the Internet: many strategies have been proposed to crawl specific websites, but defining methods for surfing the whole Web and harvesting bitexts is still an open problem. Recently, the free/open-source tool Bitextor has become one of the reference tools for this task: it has been one of the basic tools featured in European projects such as Panacea or Abu-MaTran, and it has been chosen as the reference tool for the shared task on document alignment of the 1st Conference on Machine Translation (WMT 2016). In this presentation we will describe this tool, explaining the advantages when compared to other state-of-the-art tools, and the strategies chosen to crawl large amounts of parallel data.
Niall Anderson outlines the IMPACT approach to adaptive OCR and Post Production including tools prepared by IBM CONCERT and experimental tools from: USAL, NCSR and UIBK.
Delivered at BL Demo Day - 12th July 2011
Slides of the paper Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability by Evagelos Varthis, Marios Poulos, Ilias Yarenis and Sozon Papavlasopoulos at the 3rd Edition of the DATeCH2019 International Conference
Compilers translate programs written in high-level languages into executable machine code. They perform several phases including lexical analysis, syntax analysis, semantic analysis, code generation, and optimization. The overall goal is to generate efficient executable code from the high-level source code while checking for errors.
Cross-language Linking of eGov Services to the LOD Cloud discusses linking government service catalogs across different languages to the Linked Open Data cloud. It introduces CroSeR, a tool that supports linking a source government service catalog in any language to a target catalog in English. CroSeR uses machine translation and Explicit Semantic Analysis to load catalogs, select a source service, and provide link recommendations to similar services in the target catalog. It allows linking the services using SKOS semantic relations and refining the links. CroSeR finds matchings that cannot be discovered by translation and keywords alone.
Enrique Vidal (Polytechnic University of Valencia, ES): Keyword Searching as a Trade-off between Recall and Precision. A new way to search large collections of digitised documents
co:op-READ-Convention Marburg
Technology meets Scholarship, or how Handwritten Text Recognition will Revolutionize Access to Archival Collections.
With a special focus on biographical data in archives
Hessian State Archives Marburg Friedrichsplatz 15, D - 35037 Marburg
19-21 January 2016
Google Code Search was a code search engine that indexed open source code from various sources online. It allowed programmers to search code using regular expressions, keywords, and other metadata tags. However, Google discontinued the service in 2013. Popular alternatives to Google Code for hosting and searching code include GitHub, Bitbucket, and CodePlex. These services provide version control, code review, issue tracking, and other collaboration features for developers.
This document provides an introduction to text analytics and natural language processing techniques. It discusses bag-of-words models, term frequency-inverse document frequency (TF-IDF), vector space models, distance measures, document clustering, word embeddings using word2vec, and recurrent neural networks. The agenda covers traditional "frequentist" text analysis methods as well as deep learning techniques for semantic analysis. Hands-on examples in Python are provided to illustrate document clustering, creating word embeddings, and generating text with recurrent neural networks.
Some useful tips for translators at absolutely no cost!
Learn how to create a corpus with BootCat Front End and analyze it with AntConc.
Learn how to extract changes using the free word add-in ExtractData by DocTools.
Learn how to search multiple PDFs at once using Acrobat Reader.
Diachronic Analysis of Language exploiting Google NgramAnnalina Caputo
This document discusses diachronic linguistics and exploiting Google Ngram to analyze language change over time. It introduces diachronic linguistics as the scientific study of language change and contrasts it with synchronic linguistics. Temporal random indexing (TRI) is presented as a method to build distributional semantic models on corpora with temporal information, allowing analysis of how word meanings change over time. The methodology and evaluation of applying TRI to the Italian Google Ngram corpus is described. Ongoing work on applying TRI to English Ngram and social media data is also mentioned.
The document proposes a method to classify online handwritten words and lines into six scripts: Arabic, Cyrillic, Devnagari, Han, Hebrew, or Roman. It extracts 11 spatial and temporal features from stroke data and achieves 87.1% accuracy classifying words and 95.5% for lines using 5-fold cross validation. Existing methods classify offline documents but online documents provide temporal stroke sequence data enabling new classification approaches.
Enriching the semantic web tutorial session 1Tobias Wunner
The document discusses challenges and opportunities in natural language processing for the multilingual semantic web. It provides examples of how content on the web and semantic web exhibits linguistic variations within and across languages. It also summarizes several NLP applications like information extraction and natural language generation that utilize ontologies, and notes that these applications require domain and multilingual adaptation of lexicons and extraction rules. The document argues that efficient adaptation and sharing of linguistic resources between ontology-based NLP applications is needed.
What if-your-application-could-speak, by Marcos SilveiraThoughtworks
Imagine a team developing to a specific business domain. We use languages to communicate with the client, company and within the team. We also use programming languages to develop the software. And still, we want our code to express, no only a correct syntax for that language, but the knowledge of the business domain in which we are developing.
What if it was possible to capture the business meaning and transforme it into a language?
This talk is about DSLs, it's architecture, business use, and also how to implement and test them.
Imagine a team developing to a specific business domain. We use languages to communicate with the client, company and within the team. We also use programming languages to develop the software. And still, we want our code to express, no only a correct syntax for that language, but the knowledge of the business domain in which we are developing.
And if it was possible to capture the business meaning and transforme it into a language?
This talk is about DSLs, it's architecture, business use, and also how to implement and test them.
Creative Restart 2024: Mike Martin - Finding a way around “no”Taste
Ideas that are good for business and good for the world that we live in, are what I’m passionate about.
Some ideas take a year to make, some take 8 years. I want to share two projects that best illustrate this and why it is never good to stop at “no”.
Internationalization and localization allow PHP applications to support multiple human languages. Internationalization makes an application structured to support multiple locales, while localization adds specific language translations. Common PHP internationalization approaches include message catalogs, storing translations in databases or JSON files. New services like LingoHub automate the process and improve collaboration between developers and translators.
Presentation of the paper PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text by Thorsten Vobl, Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter and Klaus Schulz in DATeCH 2014. #digidays
The document describes CONCERT, an adaptive collaborative correction platform for digitized text. It uses feedback from users to improve optical character recognition and increase productivity of post-correction. Key features include adaptive OCR, quality control tools, productivity tools like games to motivate volunteers, and monitoring of users to prevent data corruption. It has been used successfully in several library digitization projects worldwide.
This document provides an overview of machine translation and the Moses machine translation toolkit. It defines machine translation and statistical machine translation. It describes the major components of Moses, including GIZA++ for word alignment, SRILM for language modeling, and the Moses decoder. It explains how Moses uses phrase-based translation and tuning to produce translations. It also discusses how to set up and use a Moses server for translating webpages.
The document discusses the methods and resources used in the dictionary writing process. It describes the stages of analysis, transfer, and synthesis used to compile dictionary entries from a corpus. Specific software tools are used at each stage, including a corpus query system to analyze sample text, and a dictionary writing system for editors to compile and edit entries stored in a database. The advantages of these tools include guiding the analysis process, creating a comprehensive record of words, and streamlining the editorial workflow.
Parallel data has become an extremely valuable resource, not only for building new statistical machine translation systems, but also for building other useful resources for translators, such as bilingual concordancers, translation memories or bilingual lexicons. One of the most important and under-exploited sources of bilingual information is the Internet: many strategies have been proposed to crawl specific websites, but defining methods for surfing the whole Web and harvesting bitexts is still an open problem. Recently, the free/open-source tool Bitextor has become one of the reference tools for this task: it has been one of the basic tools featured in European projects such as Panacea or Abu-MaTran, and it has been chosen as the reference tool for the shared task on document alignment of the 1st Conference on Machine Translation (WMT 2016). In this presentation we will describe this tool, explaining the advantages when compared to other state-of-the-art tools, and the strategies chosen to crawl large amounts of parallel data.
Niall Anderson outlines the IMPACT approach to adaptive OCR and Post Production including tools prepared by IBM CONCERT and experimental tools from: USAL, NCSR and UIBK.
Delivered at BL Demo Day - 12th July 2011
Slides of the paper Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability by Evagelos Varthis, Marios Poulos, Ilias Yarenis and Sozon Papavlasopoulos at the 3rd Edition of the DATeCH2019 International Conference
Compilers translate programs written in high-level languages into executable machine code. They perform several phases including lexical analysis, syntax analysis, semantic analysis, code generation, and optimization. The overall goal is to generate efficient executable code from the high-level source code while checking for errors.
Cross-language Linking of eGov Services to the LOD Cloud discusses linking government service catalogs across different languages to the Linked Open Data cloud. It introduces CroSeR, a tool that supports linking a source government service catalog in any language to a target catalog in English. CroSeR uses machine translation and Explicit Semantic Analysis to load catalogs, select a source service, and provide link recommendations to similar services in the target catalog. It allows linking the services using SKOS semantic relations and refining the links. CroSeR finds matchings that cannot be discovered by translation and keywords alone.
Enrique Vidal (Polytechnic University of Valencia, ES): Keyword Searching as a Trade-off between Recall and Precision. A new way to search large collections of digitised documents
co:op-READ-Convention Marburg
Technology meets Scholarship, or how Handwritten Text Recognition will Revolutionize Access to Archival Collections.
With a special focus on biographical data in archives
Hessian State Archives Marburg Friedrichsplatz 15, D - 35037 Marburg
19-21 January 2016
Google Code Search was a code search engine that indexed open source code from various sources online. It allowed programmers to search code using regular expressions, keywords, and other metadata tags. However, Google discontinued the service in 2013. Popular alternatives to Google Code for hosting and searching code include GitHub, Bitbucket, and CodePlex. These services provide version control, code review, issue tracking, and other collaboration features for developers.
This document provides an introduction to text analytics and natural language processing techniques. It discusses bag-of-words models, term frequency-inverse document frequency (TF-IDF), vector space models, distance measures, document clustering, word embeddings using word2vec, and recurrent neural networks. The agenda covers traditional "frequentist" text analysis methods as well as deep learning techniques for semantic analysis. Hands-on examples in Python are provided to illustrate document clustering, creating word embeddings, and generating text with recurrent neural networks.
Some useful tips for translators at absolutely no cost!
Learn how to create a corpus with BootCat Front End and analyze it with AntConc.
Learn how to extract changes using the free word add-in ExtractData by DocTools.
Learn how to search multiple PDFs at once using Acrobat Reader.
Diachronic Analysis of Language exploiting Google NgramAnnalina Caputo
This document discusses diachronic linguistics and exploiting Google Ngram to analyze language change over time. It introduces diachronic linguistics as the scientific study of language change and contrasts it with synchronic linguistics. Temporal random indexing (TRI) is presented as a method to build distributional semantic models on corpora with temporal information, allowing analysis of how word meanings change over time. The methodology and evaluation of applying TRI to the Italian Google Ngram corpus is described. Ongoing work on applying TRI to English Ngram and social media data is also mentioned.
The document proposes a method to classify online handwritten words and lines into six scripts: Arabic, Cyrillic, Devnagari, Han, Hebrew, or Roman. It extracts 11 spatial and temporal features from stroke data and achieves 87.1% accuracy classifying words and 95.5% for lines using 5-fold cross validation. Existing methods classify offline documents but online documents provide temporal stroke sequence data enabling new classification approaches.
Enriching the semantic web tutorial session 1Tobias Wunner
The document discusses challenges and opportunities in natural language processing for the multilingual semantic web. It provides examples of how content on the web and semantic web exhibits linguistic variations within and across languages. It also summarizes several NLP applications like information extraction and natural language generation that utilize ontologies, and notes that these applications require domain and multilingual adaptation of lexicons and extraction rules. The document argues that efficient adaptation and sharing of linguistic resources between ontology-based NLP applications is needed.
What if-your-application-could-speak, by Marcos SilveiraThoughtworks
Imagine a team developing to a specific business domain. We use languages to communicate with the client, company and within the team. We also use programming languages to develop the software. And still, we want our code to express, no only a correct syntax for that language, but the knowledge of the business domain in which we are developing.
What if it was possible to capture the business meaning and transforme it into a language?
This talk is about DSLs, it's architecture, business use, and also how to implement and test them.
Imagine a team developing to a specific business domain. We use languages to communicate with the client, company and within the team. We also use programming languages to develop the software. And still, we want our code to express, no only a correct syntax for that language, but the knowledge of the business domain in which we are developing.
And if it was possible to capture the business meaning and transforme it into a language?
This talk is about DSLs, it's architecture, business use, and also how to implement and test them.
Similar to Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Bruce Robertson (20)
Creative Restart 2024: Mike Martin - Finding a way around “no”Taste
Ideas that are good for business and good for the world that we live in, are what I’m passionate about.
Some ideas take a year to make, some take 8 years. I want to share two projects that best illustrate this and why it is never good to stop at “no”.
🔥🔥🔥🔥🔥🔥🔥🔥🔥
إضغ بين إيديكم من أقوى الملازم التي صممتها
ملزمة تشريح الجهاز الهيكلي (نظري 3)
💀💀💀💀💀💀💀💀💀💀
تتميز هذهِ الملزمة بعِدة مُميزات :
1- مُترجمة ترجمة تُناسب جميع المستويات
2- تحتوي على 78 رسم توضيحي لكل كلمة موجودة بالملزمة (لكل كلمة !!!!)
#فهم_ماكو_درخ
3- دقة الكتابة والصور عالية جداً جداً جداً
4- هُنالك بعض المعلومات تم توضيحها بشكل تفصيلي جداً (تُعتبر لدى الطالب أو الطالبة بإنها معلومات مُبهمة ومع ذلك تم توضيح هذهِ المعلومات المُبهمة بشكل تفصيلي جداً
5- الملزمة تشرح نفسها ب نفسها بس تكلك تعال اقراني
6- تحتوي الملزمة في اول سلايد على خارطة تتضمن جميع تفرُعات معلومات الجهاز الهيكلي المذكورة في هذهِ الملزمة
واخيراً هذهِ الملزمة حلالٌ عليكم وإتمنى منكم إن تدعولي بالخير والصحة والعافية فقط
كل التوفيق زملائي وزميلاتي ، زميلكم محمد الذهبي 💊💊
🔥🔥🔥🔥🔥🔥🔥🔥🔥
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
How to Manage Reception Report in Odoo 17Celine George
A business may deal with both sales and purchases occasionally. They buy things from vendors and then sell them to their customers. Such dealings can be confusing at times. Because multiple clients may inquire about the same product at the same time, after purchasing those products, customers must be assigned to them. Odoo has a tool called Reception Report that can be used to complete this assignment. By enabling this, a reception report comes automatically after confirming a receipt, from which we can assign products to orders.
Information and Communication Technology in EducationMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 2)-𝐏𝐫𝐞𝐥𝐢𝐦𝐬
𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐭𝐡𝐞 𝐈𝐂𝐓 𝐢𝐧 𝐞𝐝𝐮𝐜𝐚𝐭𝐢𝐨𝐧:
Students will be able to explain the role and impact of Information and Communication Technology (ICT) in education. They will understand how ICT tools, such as computers, the internet, and educational software, enhance learning and teaching processes. By exploring various ICT applications, students will recognize how these technologies facilitate access to information, improve communication, support collaboration, and enable personalized learning experiences.
𝐃𝐢𝐬𝐜𝐮𝐬𝐬 𝐭𝐡𝐞 𝐫𝐞𝐥𝐢𝐚𝐛𝐥𝐞 𝐬𝐨𝐮𝐫𝐜𝐞𝐬 𝐨𝐧 𝐭𝐡𝐞 𝐢𝐧𝐭𝐞𝐫𝐧𝐞𝐭:
-Students will be able to discuss what constitutes reliable sources on the internet. They will learn to identify key characteristics of trustworthy information, such as credibility, accuracy, and authority. By examining different types of online sources, students will develop skills to evaluate the reliability of websites and content, ensuring they can distinguish between reputable information and misinformation.
How to Setup Default Value for a Field in Odoo 17Celine George
In Odoo, we can set a default value for a field during the creation of a record for a model. We have many methods in odoo for setting a default value to the field.
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapitolTechU
Slides from a Capitol Technology University webinar held June 20, 2024. The webinar featured Dr. Donovan Wright, presenting on the Department of Defense Digital Transformation.
Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Bruce Robertson
1. An Integrated System for
Polytonic Greek OCR
I. Generating the Data
Bruce Robertson, Dept. of Classics, Mount
Allison University, New Brunswick, Canada
Digital Classicist Seminar, Institute of Classical
Studies, London, UK, July 19, 2013
3. Why Ancient Greek OCR?
1. Rapid digitization of Greek texts not yet in
digital libraries
4. Why Ancient Greek OCR?
1. Rapid digitization of Greek texts not yet in
digital libraries
2. Study of textual variants and app. crit.
5. Why Ancient Greek OCR?
1. Rapid digitization of Greek texts not yet in
digital libraries
2. Study of textual variants and app. crit.
3. Text reuse analysis
6. Why Ancient Greek OCR?
1. Rapid digitization of Greek texts not yet in
digital libraries
2. Study of textual variants and app. crit.
3. Text reuse analysis
4. General-purpose OCR search, like Google
Books
7. Use Manual Editing? Automatic Spell-
checking?
Digitization ✓ ✓
Textual Variants ✗ ✗
Text Reuse ✗ ✓
OCR Search ✗ ✗ or ✓
8. Use Manual Editing? Automatic Spell-
checking?
Digitization ✓ ✓
Textual Variants ✗ ✗
Text Reuse ✗ ✓
OCR Search ✗ ✗ or ✓
21. Contextless 'Greekness' Index
● Devised by Dr. Boschetti
● Based on dictionary and likely sequences of
letters, etc.
● Named 'B-score' in these slides
22. Archive.org
● Provides:
○ Thousands of volumes rendered in high-
resolution (400 ppi +) colour images
○ OCR results from ABBYY Finereader
■ Excellent Latin-script recognition
■ Poor Greek results
■ Top-quality line-segmentation
23. Open-source OCR Engines
● Gamera
○ Current focus of my team
● Tesseract
○ Nick White has worked extensively on this to generate
good results
● OCRopus
○ Dr. Boschetti recently has been able to use Tesseract
training sets for this engine
26. HOCR
Output
Page Segmentation Thru HOCR
Input
Gamera 3.3.3 Image Recognition
JP2 Input
Library
Greek OCR For Gamera
Classifiers for Teubner Sans, Teubner Serif, Oxford (Loeb, etc.)
HOCR Results at a range of
binarization thresholds
Parallel
Process
x35
Cores
410,000 word
dictionary from
open Perseus
Greek texts
Weighted
Levenshtein
Automatic OCR
Spellchecker
(x14 cores)
Per-volume
spellcheck
table
Reduction to unique
Greek strings
Images from
Archive.org
ABBYY OCR
Information file
ABBYY to
HOCR
Conversion
Score table for
binarization
thresholds
Select highest-ranking
binarization page
Weighted Edit
Table for Classifier
Replace
spellchecked words
...
HOCR Output
Replace Latin-script
output words with ones
in same position from
Archive's ABBYY
output Does ABBYY
OCR file
contain Latin-
script output?
Rigaudon Greek
OCR Process
Automatic
OCR
Spellcheck
OCR
HOCR Latin /
Greek
Combining
HOCR
"Blending"
Boschetti scoring
Replace non-dictionary words with
dict words from other binarization
pages
27. Raw HOCR Production Using Gamera
● Plugin for Gamera OCR allows it to import high-quality
line-segmentation information, compensating for Gamera's
poor results in this critical function
● Plugin to output HOCR
● Wrapper function generates a range of output pages based
on binarization threshold (typically 10 - 20 per page)
28. HOCR 'Blending'
● This step aims to gather word-by-word the 'best' results
from the range of results pages for each image
● Selects the highest-scoring result page overall
● Where a Greek word in this page is not in the dictionary
and another page has a dictionary word in the exact same
physical location, it replaces with dictionary word
29. Automatic Spellcheck
● All pages in volume are reduced to a set of unique,
decomposed Greek strings
● These are compared to dictionary using Levenshtein
distances
● A 'weighting table', suitable for a given font, indicates
which edits are preferable or allowed
● Result is 'light' correction, esp. of diacritics
31. Optionally injecting Greek into
Original Latin HOCR
● Don't want to try to get excellent Greek and Latin results,
esp. when ABBYY and others do better job with Latin
● In the case that archive.org provides Latin OCR:
○ If Rigaudon's output word is Greek, replace archive.
org's ABBYY output word with Rigaudon's
37. Multiple OCR Engines
● Take ABBYY data out of the process
○ With 'cleaning' Tesseract's line-segmentation is often
as good
● Use Nick White's general-purpose polytonic classifier and
ones specifically designed for a font
HOCR Results at a range of
binarization thresholds
Score table for
binarization
thresholds
Select highest-ranking
binarization page
Boschetti scoring
Replace non-dictionary words with
dict words from other binarization
pages
TesseractGameraOCRopus
Line Segmentation
OCR
HOCR
"Blending"
39. An Integrated System for Generating and
Correcting Polytonic Greek OCR
Bruce Robertson and Federico Boschetti
Part II
The Proof-reading Process
Federico Boschetti∗
federico.boschetti@ilc.cnr.it
∗ILC-CNR of Pisa
Digital Classicist Seminars – London, 19 July 2013
Federico Boschetti Generating and Correcting Polytonic Greek OCR 1/ 20
40. Information Aggregation
Proof-reader Web Application
False positives
Introduction
Manual corrections on OCR output may be performed by
Experts Classicists devoted to proof-reading for a long-term
project
Data Entry Firms Professional proof-readers not skilled in
the target language(s)
Crowd Sourcing Students that are learning the target
language(s)
Random Volunteers People with heterogeneous education
and skills
Federico Boschetti Generating and Correcting Polytonic Greek OCR 1/ 20
41. Information Aggregation
Proof-reader Web Application
False positives
Introduction
For this reason proof-reading tools focused on ancient languages
should be
centralized
easy to use
based on image / text comparison line by line
optimized to catch attention on possible errors, distinguished
by category
efficiently providing the most probable correction
Federico Boschetti Generating and Correcting Polytonic Greek OCR 2/ 20
42. Information Aggregation
Proof-reader Web Application
False positives
Enriched hocr files
Alignment with other editions
False negatives
Overview
1 Information Aggregation
Enriched hocr files
Alignment with other editions
False negatives
2 Proof-reader Web Application
3 False positives
Federico Boschetti Generating and Correcting Polytonic Greek OCR 3/ 20
43. Information Aggregation
Proof-reader Web Application
False positives
Enriched hocr files
Alignment with other editions
False negatives
Enriched hocr files
OCR output formatted in hocr microformat
The hocr output produced by Rigaudon is postprocessed, in order
to add information managed by the Proof-reading Web Application
Multiple sources
Dictionaries with and without diacritics
Multiple editions of the same work (if available)
Syllabic repertory
Federico Boschetti Generating and Correcting Polytonic Greek OCR 3/ 20
44. Information Aggregation
Proof-reader Web Application
False positives
Enriched hocr files
Alignment with other editions
False negatives
Dictionaries
In order to identify possible errors and provide good suggestions to
correct them, the OCR output is spell-checked and the potential
errors are processed step by step
The spell-checker is based on dictionaries generated from Perseus’
text collection. An upper-case dictionary is used to evaluate if a
character sequence is a word with a wrong accent or breathing
mark
Federico Boschetti Generating and Correcting Polytonic Greek OCR 4/ 20
45. Information Aggregation
Proof-reader Web Application
False positives
Enriched hocr files
Alignment with other editions
False negatives
Alignment with other editions
When another edition of the same work is available, the two
editions are aligned word by word applying the Needleman-Wunsch
algorithm
ὁ Γαδαρεὺς ἐν ταῖς Χάρισιν ἐπιγραφομέναις ἔφη τὸν ῞Ομηρον Σύρον ὄντα τὸ
| | | | | | | | | | | | |
ὁ Γαδαρεὺς ἐν τ αχς Σάρισιν ἐπιγραφομέναις ἔφη τὸν ῞Ομeρον Σύρον ὄντα τὸ
Federico Boschetti Generating and Correcting Polytonic Greek OCR 5/ 20
46. Information Aggregation
Proof-reader Web Application
False positives
Enriched hocr files
Alignment with other editions
False negatives
False negatives and the risk of digital contaminatio
An example
Rigaudon on the Anecdota Graeca edited by Cramer
recognizes the word χόρος, which is rejected by the current
spellchecker
The spell-checker suggests χορός as a correction
Also the alignment with Koster’s edition of the Prolegomena
de comoedia suggests χορός
But the page image contains χόρος, a late form attested from
Athenaeus to the Byzantine period
Federico Boschetti Generating and Correcting Polytonic Greek OCR 6/ 20
47. Information Aggregation
Proof-reader Web Application
False positives
Enriched hocr files
Alignment with other editions
False negatives
Syllabication
In order to prevent false negatives due to (rare) variants
ignored by the dictionaries, the system distinguishes between
random character sequences and well-formed syllabic
sequences
Each potential error is divided in syllables and each syllable is
evaluated according to its position
For example, χό-ρος is a well-formed syllabic sequence: χό- is
a valid Greek initial syllable and -ρος is a valid final Greek
syllable
Federico Boschetti Generating and Correcting Polytonic Greek OCR 7/ 20
48. Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Overview
1 Information Aggregation
2 Proof-reader Web Application
The web interface
Cues
Self-corrections
3 False positives
Federico Boschetti Generating and Correcting Polytonic Greek OCR 8/ 20
49. Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Centralization
The proof-reader is a web application inspired by the Mozilla
hocr Editor interface but employs the WikiSource
collaborative philosophy
Texts are stored in a central XML native database
Federico Boschetti Generating and Correcting Polytonic Greek OCR 8/ 20
50. Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
The Control Panel
Federico Boschetti Generating and Correcting Polytonic Greek OCR 9/ 20
51. Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Image / Text Pairs
Federico Boschetti Generating and Correcting Polytonic Greek OCR 10/ 20
52. Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Cues
Wrong accents and breathing marks Attention is focused
on diacritics
Self-corrections Special care is necessary to avoid the risk of
contaminatio
Errors Random errors
Federico Boschetti Generating and Correcting Polytonic Greek OCR 11/ 20
53. Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Example
Federico Boschetti Generating and Correcting Polytonic Greek OCR 12/ 20
54. Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Self-corrections and suggestions generated by alignment
In a self-correction, the reading has been substituted by the aligned
word of another edition. Self corrections need three conditions:
character sequence is refused by the spell-checker
edit distance between the character sequence and the aligned
edition is very close
the character sequence is not a well-formed syllabic sequence
Federico Boschetti Generating and Correcting Polytonic Greek OCR 13/ 20
55. Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Example
Federico Boschetti Generating and Correcting Polytonic Greek OCR 14/ 20
56. Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Dynamic Dictionaries
Dictionaries used by the spell-checker are dynamically rebuilt
when a milestone in proof-reading is reached
Enlarging the dictionaries, rare variants are acquired and used
to spell-check the next works
Federico Boschetti Generating and Correcting Polytonic Greek OCR 15/ 20
57. Information Aggregation
Proof-reader Web Application
False positives
Overview
1 Information Aggregation
2 Proof-reader Web Application
3 False positives
Federico Boschetti Generating and Correcting Polytonic Greek OCR 16/ 20
58. Information Aggregation
Proof-reader Web Application
False positives
False positives are deceitful
By definition, false positives pass the spell-checking
Specially if they are graphically similar to the correct word,
such as δ and ὁ in Greek or m and ni in Latin, they are
difficult to be seen, in particular by proof-readers not skilled in
the target language(s)
Federico Boschetti Generating and Correcting Polytonic Greek OCR 16/ 20
61. Information Aggregation
Proof-reader Web Application
False positives
Semantic Distance
Semantic distance is calculated along the nodes of WordNet’s
hierarchy, i.e. along the chain of hyponyms / hypernyms, in
order to reach co-hyponyms
Different translations of the same concepts (e.g. vis in Latin
and efficacia in Italian or efficacy in English) have semantic
distance equal to zero
Semantically unrelated words (e.g. vinum in Latin and
efficacia in Italian) have a large semantic distance
Federico Boschetti Generating and Correcting Polytonic Greek OCR 18/ 20
62. Information Aggregation
Proof-reader Web Application
False positives
AncientWordNet
Synsets of AncientGreekWordNet and LatinWordNet have
been extracted from bilingual dictionaries
They are aligned to modern languages such as English, Italian,
etc.
Federico Boschetti Generating and Correcting Polytonic Greek OCR 19/ 20
63. Information Aggregation
Proof-reader Web Application
False positives
Conclusion
The proof-reading Web Application puts together the main
features of individual and collaborative proof-reading tools
currently available
The entire work-flow is circular: Training OCR - Performing
OCR - Spell-checking OCR - Correcting OCR - Enlarging
dictionaries - Retraining OCR
Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20
64. Information Aggregation
Proof-reader Web Application
False positives
Thank you for your attention
Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20
65. Information Aggregation
Proof-reader Web Application
False positives
References
S. Feng, R. Manmatha: A Hierarchical, HMM-based Automatic Evaluation of OCR Accuracy for a Digital
Library of Books. JCDL 2006, 109–118 (2006)
W.B. Lund, E.K. Ringger: Improving Optical Character Recognition through Efficient Multiple System
Alignment, JCDL (2009)
M. Reynaert: Non-interactive OCR Post-correction for Giga-Scale Digitization Projects. A. Gelbukh (ed.):
CICLing 2008, LNCS 4919, 617–630 (2008)
M. Reynaert: All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction
Evaluation. 6th International Conference on Language Resources and Evaluation 2008, 1867–1872 (2008)
C. Ringlstetter, K. Schulz, S. Mihov, K. Louka: The same is not the same - postcorrection of alphabet
confusion errors in mixed-alphabet OCR recognition. 8th International Conference on Document Analysis
and Recognition, 1, 406–410 (2005)
M. Spencer, C. Howe: Collating texts using progressive multiple alignment. Computer and the Humanities,
37, 1, 97–109 (2003)
G. Stewart, G. Crane, A. Babeu: A New Generation of Textual Corpora. JCDL 2007, 356–365 (2007)
L. Zhuang, X. Zhu: An OCR Post-processing Approach Based on Multi-knowledge. 9th International
Conference on Knowledge-Based Intelligent Information and Engineering Systems, 346–352 (2005)
Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20