This document summarizes research on analyzing and extracting information from scanned historical documents. It discusses developing techniques for layout analysis, handwriting recognition, and information retrieval on datasets of manuscripts from the 8th to 18th centuries. The techniques aim to recognize page elements, extract text from images, and retrieve relevant text despite errors from the recognition systems. Evaluation shows the approaches can analyze layout with 8% error and recognize text with up to 18% word error for certain datasets. The methods aim to support applications like computer-assisted paleography and retrieval on historical collections.
MyManuskrip is a digital library of Malay manuscripts created using the open source Greenstone software. It aims to provide collaborative access to manuscripts from various repositories in Malaysia and abroad. The digital library contains over 166 titles across 5 collections and uses Dublin Core metadata. While it does not have a controlled vocabulary, manuscripts are categorized based on general subject areas like religion, language, and history. The digital library meets definitions of being an electronic set of resources for creating, searching and using digital information as well as an organized collection of digitized materials accessible over a network.
Knowledge from manuscript to virtual reality-its processing-a journeySarika Sawant
Poster Presentation: NAAC sponsored National conference on “Strategies for Implementing Best Practices in Teaching –Learning and Evaluation” on 2nd and 3rd March 2016
The document discusses the documentation and digitization of woodcarving artifacts from the collection of Nik Rashiddin Nik Hussein, a renowned Malaysian woodcarver. The collection includes over 350 woodcarving artifacts, 300 keris, and thousands of drawings, photographs, and books documenting Malay woodcarving traditions. The ATMA project aims to document the collection's artifacts, create a digital database, and provide online access to preserve Malay cultural heritage and facilitate research on woodcarving.
The Library as a Digital Research infrastructure: Digital Initiatives and Dig...lorna_hughes
Memory institutions have built up expertise and taken the lead in all aspects of digital humanities, especially the development and implementation of digital methods for the capture, analysis and dissemination of archives and special collections, including manuscripts. In recent years, these initiatives have become embedded into Digital Humanities Initiatives, Centres and Programmes within research libraries, adding value to the existing relationships between libraries and scholarly iniatiatives. These activities have fostered the development of new projects that bring into collaboration the skills and expertise of academics, librarians, and digital humanists, making the Library increasingly a “digital research infrastructure”. This presentation will discuss these developments based on the experience of the Research Programme in Digital Collections at the National Library of Wales, specifically discussing some recent experimentation with new methods for manuscript digitization and dissemination, including hyperspectral digitization of the Library’s Chaucer manuscripts. The presentation will also discuss the wider embedding of this work within the European Digital Humanities Context, through collaborations with the ESF Research Network Programe NeDiMAH (Network for Digital Methods in the Arts and Humanities).
From OBO to OWL and back - building scalable ontologiesdosumis
This document provides an introduction to converting ontologies between the OBO format and the OWL format. It discusses the benefits of using OWL, including taking advantage of reasoning and automated classification. It also introduces Oort, a tool for generating OBO files that do not require reasoning from ontologies that do. The document then provides a tutorial on building ontologies, including maintaining multiple classification schemes, using relationships to specify necessary and sufficient conditions for class membership, and using error messages to identify issues.
This document discusses using semantic web technologies to enhance digital libraries. It describes how ontologies like MarcOnt can lift legacy metadata into a semantic format to improve search and interoperability. The JeromeDL project is presented as a case study that uses MarcOnt and other ontologies to power semantic search and sharing features for bibliographic descriptions. Semantic technologies allow digital libraries to better integrate information and provide more robust, user-friendly search interfaces.
Slides from the Introduction and Theoretical Foundations of New Media course of the Interactive Media and Knowledge Environments master program (Tallinn University).
The document discusses the preservation and conservation of manuscripts in India. It outlines various initiatives taken by the National Mission for Manuscripts and National Archives of India to locate, catalog, conserve, and provide access to manuscripts. Digitization is highlighted as a key process to preserve manuscripts. Various preservation challenges for libraries like adverse environmental conditions and biological pests are also mentioned. The document emphasizes the importance of preserving manuscripts to protect India's cultural heritage.
MyManuskrip is a digital library of Malay manuscripts created using the open source Greenstone software. It aims to provide collaborative access to manuscripts from various repositories in Malaysia and abroad. The digital library contains over 166 titles across 5 collections and uses Dublin Core metadata. While it does not have a controlled vocabulary, manuscripts are categorized based on general subject areas like religion, language, and history. The digital library meets definitions of being an electronic set of resources for creating, searching and using digital information as well as an organized collection of digitized materials accessible over a network.
Knowledge from manuscript to virtual reality-its processing-a journeySarika Sawant
Poster Presentation: NAAC sponsored National conference on “Strategies for Implementing Best Practices in Teaching –Learning and Evaluation” on 2nd and 3rd March 2016
The document discusses the documentation and digitization of woodcarving artifacts from the collection of Nik Rashiddin Nik Hussein, a renowned Malaysian woodcarver. The collection includes over 350 woodcarving artifacts, 300 keris, and thousands of drawings, photographs, and books documenting Malay woodcarving traditions. The ATMA project aims to document the collection's artifacts, create a digital database, and provide online access to preserve Malay cultural heritage and facilitate research on woodcarving.
The Library as a Digital Research infrastructure: Digital Initiatives and Dig...lorna_hughes
Memory institutions have built up expertise and taken the lead in all aspects of digital humanities, especially the development and implementation of digital methods for the capture, analysis and dissemination of archives and special collections, including manuscripts. In recent years, these initiatives have become embedded into Digital Humanities Initiatives, Centres and Programmes within research libraries, adding value to the existing relationships between libraries and scholarly iniatiatives. These activities have fostered the development of new projects that bring into collaboration the skills and expertise of academics, librarians, and digital humanists, making the Library increasingly a “digital research infrastructure”. This presentation will discuss these developments based on the experience of the Research Programme in Digital Collections at the National Library of Wales, specifically discussing some recent experimentation with new methods for manuscript digitization and dissemination, including hyperspectral digitization of the Library’s Chaucer manuscripts. The presentation will also discuss the wider embedding of this work within the European Digital Humanities Context, through collaborations with the ESF Research Network Programe NeDiMAH (Network for Digital Methods in the Arts and Humanities).
From OBO to OWL and back - building scalable ontologiesdosumis
This document provides an introduction to converting ontologies between the OBO format and the OWL format. It discusses the benefits of using OWL, including taking advantage of reasoning and automated classification. It also introduces Oort, a tool for generating OBO files that do not require reasoning from ontologies that do. The document then provides a tutorial on building ontologies, including maintaining multiple classification schemes, using relationships to specify necessary and sufficient conditions for class membership, and using error messages to identify issues.
This document discusses using semantic web technologies to enhance digital libraries. It describes how ontologies like MarcOnt can lift legacy metadata into a semantic format to improve search and interoperability. The JeromeDL project is presented as a case study that uses MarcOnt and other ontologies to power semantic search and sharing features for bibliographic descriptions. Semantic technologies allow digital libraries to better integrate information and provide more robust, user-friendly search interfaces.
Slides from the Introduction and Theoretical Foundations of New Media course of the Interactive Media and Knowledge Environments master program (Tallinn University).
The document discusses the preservation and conservation of manuscripts in India. It outlines various initiatives taken by the National Mission for Manuscripts and National Archives of India to locate, catalog, conserve, and provide access to manuscripts. Digitization is highlighted as a key process to preserve manuscripts. Various preservation challenges for libraries like adverse environmental conditions and biological pests are also mentioned. The document emphasizes the importance of preserving manuscripts to protect India's cultural heritage.
La compañía TOTAL DOCUMENT SOLUTION ofrece servicios de manejo integral de documentos con tecnología avanzada a 15 clientes potenciales, generando beneficios como mejor control, administración y disposición de documentos físicos y digitales. La compañía cuenta con experiencia trabajando con grandes empresas en Colombia.
Este documento presenta un catálogo de software educativo libre para Panamá. Explica brevemente qué es el software libre y por qué es importante su uso en la educación, y luego describe varios sistemas operativos y programas educativos libres para diferentes niveles, incluyendo preescolar, primaria, premedia y media. Finalmente, proporciona detalles sobre 43 programas educativos libres categorizados por nivel y área temática.
1. A professora lecionou várias disciplinas de arquitetura e urbanismo na Universidade de Passo Fundo, como projeto arquitetônico, projeto urbano e paisagístico.
2. Ela também supervisionou estágios em obras e orientou pesquisas de TCC.
3. Adicionalmente, ministrou um curso de formação em construção civil no Colégio Agrícola de Frederico Westphalen.
v Patients undergoing prolonged sedation for gastrointestinal and cardiac issues were studied.
v Morphine and midazolam were the most commonly used and highest dose drugs for sedation.
v Higher average daily doses of morphine and midazolam during sedation were strongly correlated with more neuroradiological abnormalities on brain MRIs in infants under 12 months old. Days of sedation and anesthesia events did not correlate with MRI findings.
v Prolonged sedation with opioids and benzodiazepines may negatively impact brain development in infants, warranting further research.
This document discusses personal development and health (PDH) education for children and how it teaches them about safe and appropriate behaviors, building important decision making abilities. As a result, students learn to act responsibly and contribute positively to society. The document is repeated with the same text and number "Van Nong 17221304".
SlideShare es un sitio web que permite a los usuarios subir y compartir presentaciones de diapositivas de PowerPoint, documentos de Word u otros formatos de manera pública o privada. Ofrece la ventaja de poder dar conferencias sin necesidad de cargar la presentación, ya que esta se puede ver desde cualquier computadora simplemente abriendo una página web, además de facilitar el compartir trabajos con otros de manera más sencilla que por correo electrónico. Sin embargo, las presentaciones en PowerPoint son un formato limitado sin explicaciones adicionales y no permiten combin
This document summarizes articles from an HR e-bulletin published by ImaginativeHR in December 2015.
The first article discusses managing culture clashes during mergers and acquisitions, noting that differences in corporate cultures often lead to integration challenges. Honesty about cultural fit is important early on. Successful integrations require defining a future culture and implementing plans to encourage behaviors that support the new culture.
The second article discusses ensuring consistency in international outplacement services. While support has expanded globally at different rates, drawing on local expertise leads to the most effective outcomes. ImaginativeHR delivers career transition support internationally by working with local experts and providing centralized support.
The third article discusses developing coaching cultures in corporations to
Tata Destination 150 Value Homes Noida Expressway Floor Plan Price List Locat...Aliva Kar
Tata Value Homes is developing a new smart community project called Destination 150 in Noida, India. The 20-acre project will include over 6,000 homes across 6 towers with modern amenities like a 25,000 square foot clubhouse, swimming pools, gardens, and retail space. Destination 150 is located near key transportation routes with access to schools, hospitals, and entertainment in Noida. The homes range from 2-3 bedrooms and will incorporate smart home technologies for security, parking, and other conveniences.
This document discusses various aspects of subject cataloguing including:
1. It defines subject cataloguing as showing documents on specific subjects possessed by a library and bringing together entries on a subject.
2. It outlines different types of subject catalogues and the objectives of subject entries/cataloguing.
3. It discusses principles of subject entries, problems in deriving subject entries, and methods of subject analysis.
To work with a company this will assign challenging projects and help advance his career. Currently working as a SEO Executive since 2015, responsibilities include monitoring and increasing web traffic through keyword research, on-page and off-page optimization, content analysis, social media marketing, link building, and reporting on search engine rankings. Previously worked as a Process Associate from 2014 to 2015 which involved answering calls and scheduling sales appointments. Holds an MCA and seeks to leverage technical skills in C, C++, Java, Oracle, and web technologies.
Digibury: Martin Jewiss - Colour, Creativity and Running AwayLizzieHodgson
In his talk Colour, Creativity, and Running Away, Designer Martin Jewiss explores the impact colour has on psychological function. Based on recent research, Martin presents his development of a code environment colour palette to help designers and developers improve their creativity and productivity.
Babak Rasolzadeh: The importance of entitiesZoltan Varju
Meltwater is a Business Intelligence company of +1000 individuals spread across ~60 offices in ~30 countries with over 26,000 clients. At Meltwater we see ourselves as a Outside Insights company, meaning we seek to deliver similar type of business analytics & insights as traditional CRM dashboards and ERP systems used to, except by leveraging data outside the firewall (social media, news, blogs etc.) we believe the insights can be much more decisive and predictive for our clients business. Part of the challenge with this is of course structuring the unstructured data out there. This is why the Data Science team at Meltwater has the mission to ingest, categorize, label, classify, and a whole range of other enrichments on the content that we crawl in order to index it properly in our big data architecture and make it available for our insights dashboard. We do these enrichments in +17 languages.
Babak Rasolzadeh is the Director of Data Science & NLP at Meltwater and has a team of 24 engineers on this team. Prior to Meltwater, Babak was the co-founder of OculusAI, a computer vision start-up in Sweden, that was sold to Meltwater in 2013. He holds a PhD in Computer Vision, from KTH in Sweden, and has worked on things ranging from self-driving cars to humanoid robots and mobile object recognition. He is an advisor for several startups here in US and Sweden.
Understanding natural language processingjbene mourad
Natural language processing (NLP) is a field that uses computer science techniques to understand and work with human languages. NLP involves preprocessing text through steps like normalization, tokenization, removing stop words, stemming and lemmatization. It represents text numerically using methods like bag-of-words and word embeddings. Sequence modeling with RNNs, GRUs and LSTMs is used for tasks like machine translation and conversational AI. NLP has many applications including machine translation, conversational assistants and analyzing large amounts of text data.
This document discusses using machine learning techniques like neural networks to help decipher ancient scripts and languages. It describes how character-level sequence-to-sequence models can be used to identify cognates between related languages. Additional techniques like network flows and dynamic programming are used to model monotonic character alignments and jointly segment and match tokens between known and unknown languages. The approaches are able to identify cognates between languages like Ugaritic and Hebrew as well as segment and match the unknown Iberian language. Neural models that incorporate linguistic features like phonological embeddings are shown to improve decipherment performance.
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
This document provides an introduction to natural language processing (NLP). It discusses key topics in NLP including languages and intelligence, the goals of NLP, applications of NLP, and general themes in NLP like ambiguity in language and statistical vs rule-based methods. The document also previews specific NLP techniques that will be covered like part-of-speech tagging, parsing, grammar induction, and finite state analysis. Empirical approaches to NLP are discussed including analyzing word frequencies in corpora and addressing data sparseness issues.
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Jinho Choi
This document summarizes a research paper that proposes a new transformer model for span-based question answering on dialogue transcripts. The model is pretrained on tasks like masked language modeling at the token and utterance level, as well as utterance order prediction, using the Friends TV show transcript corpus. It is then fine-tuned jointly on two tasks: utterance ID prediction and token span prediction. Evaluation on the FriendsQA dataset shows the proposed model outperforms BERT and RoBERTa baselines. However, analysis finds the model still struggles with inference in dialogues and representing speakers.
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
This document provides an overview of statistical natural language processing (NLP). It begins with introducing the speaker, Mona Diab, and their research interests in NLP. It then discusses the growing amount of digital data being produced and the potential for machines to process and understand human language. However, language is complex with ambiguity, and good NLP solutions require both linguistic and machine learning knowledge. The document outlines some of the goals and challenges of NLP, including resolving ambiguity, and provides examples of NLP applications and techniques like probabilistic models built from language data.
La compañía TOTAL DOCUMENT SOLUTION ofrece servicios de manejo integral de documentos con tecnología avanzada a 15 clientes potenciales, generando beneficios como mejor control, administración y disposición de documentos físicos y digitales. La compañía cuenta con experiencia trabajando con grandes empresas en Colombia.
Este documento presenta un catálogo de software educativo libre para Panamá. Explica brevemente qué es el software libre y por qué es importante su uso en la educación, y luego describe varios sistemas operativos y programas educativos libres para diferentes niveles, incluyendo preescolar, primaria, premedia y media. Finalmente, proporciona detalles sobre 43 programas educativos libres categorizados por nivel y área temática.
1. A professora lecionou várias disciplinas de arquitetura e urbanismo na Universidade de Passo Fundo, como projeto arquitetônico, projeto urbano e paisagístico.
2. Ela também supervisionou estágios em obras e orientou pesquisas de TCC.
3. Adicionalmente, ministrou um curso de formação em construção civil no Colégio Agrícola de Frederico Westphalen.
v Patients undergoing prolonged sedation for gastrointestinal and cardiac issues were studied.
v Morphine and midazolam were the most commonly used and highest dose drugs for sedation.
v Higher average daily doses of morphine and midazolam during sedation were strongly correlated with more neuroradiological abnormalities on brain MRIs in infants under 12 months old. Days of sedation and anesthesia events did not correlate with MRI findings.
v Prolonged sedation with opioids and benzodiazepines may negatively impact brain development in infants, warranting further research.
This document discusses personal development and health (PDH) education for children and how it teaches them about safe and appropriate behaviors, building important decision making abilities. As a result, students learn to act responsibly and contribute positively to society. The document is repeated with the same text and number "Van Nong 17221304".
SlideShare es un sitio web que permite a los usuarios subir y compartir presentaciones de diapositivas de PowerPoint, documentos de Word u otros formatos de manera pública o privada. Ofrece la ventaja de poder dar conferencias sin necesidad de cargar la presentación, ya que esta se puede ver desde cualquier computadora simplemente abriendo una página web, además de facilitar el compartir trabajos con otros de manera más sencilla que por correo electrónico. Sin embargo, las presentaciones en PowerPoint son un formato limitado sin explicaciones adicionales y no permiten combin
This document summarizes articles from an HR e-bulletin published by ImaginativeHR in December 2015.
The first article discusses managing culture clashes during mergers and acquisitions, noting that differences in corporate cultures often lead to integration challenges. Honesty about cultural fit is important early on. Successful integrations require defining a future culture and implementing plans to encourage behaviors that support the new culture.
The second article discusses ensuring consistency in international outplacement services. While support has expanded globally at different rates, drawing on local expertise leads to the most effective outcomes. ImaginativeHR delivers career transition support internationally by working with local experts and providing centralized support.
The third article discusses developing coaching cultures in corporations to
Tata Destination 150 Value Homes Noida Expressway Floor Plan Price List Locat...Aliva Kar
Tata Value Homes is developing a new smart community project called Destination 150 in Noida, India. The 20-acre project will include over 6,000 homes across 6 towers with modern amenities like a 25,000 square foot clubhouse, swimming pools, gardens, and retail space. Destination 150 is located near key transportation routes with access to schools, hospitals, and entertainment in Noida. The homes range from 2-3 bedrooms and will incorporate smart home technologies for security, parking, and other conveniences.
This document discusses various aspects of subject cataloguing including:
1. It defines subject cataloguing as showing documents on specific subjects possessed by a library and bringing together entries on a subject.
2. It outlines different types of subject catalogues and the objectives of subject entries/cataloguing.
3. It discusses principles of subject entries, problems in deriving subject entries, and methods of subject analysis.
To work with a company this will assign challenging projects and help advance his career. Currently working as a SEO Executive since 2015, responsibilities include monitoring and increasing web traffic through keyword research, on-page and off-page optimization, content analysis, social media marketing, link building, and reporting on search engine rankings. Previously worked as a Process Associate from 2014 to 2015 which involved answering calls and scheduling sales appointments. Holds an MCA and seeks to leverage technical skills in C, C++, Java, Oracle, and web technologies.
Digibury: Martin Jewiss - Colour, Creativity and Running AwayLizzieHodgson
In his talk Colour, Creativity, and Running Away, Designer Martin Jewiss explores the impact colour has on psychological function. Based on recent research, Martin presents his development of a code environment colour palette to help designers and developers improve their creativity and productivity.
Babak Rasolzadeh: The importance of entitiesZoltan Varju
Meltwater is a Business Intelligence company of +1000 individuals spread across ~60 offices in ~30 countries with over 26,000 clients. At Meltwater we see ourselves as a Outside Insights company, meaning we seek to deliver similar type of business analytics & insights as traditional CRM dashboards and ERP systems used to, except by leveraging data outside the firewall (social media, news, blogs etc.) we believe the insights can be much more decisive and predictive for our clients business. Part of the challenge with this is of course structuring the unstructured data out there. This is why the Data Science team at Meltwater has the mission to ingest, categorize, label, classify, and a whole range of other enrichments on the content that we crawl in order to index it properly in our big data architecture and make it available for our insights dashboard. We do these enrichments in +17 languages.
Babak Rasolzadeh is the Director of Data Science & NLP at Meltwater and has a team of 24 engineers on this team. Prior to Meltwater, Babak was the co-founder of OculusAI, a computer vision start-up in Sweden, that was sold to Meltwater in 2013. He holds a PhD in Computer Vision, from KTH in Sweden, and has worked on things ranging from self-driving cars to humanoid robots and mobile object recognition. He is an advisor for several startups here in US and Sweden.
Understanding natural language processingjbene mourad
Natural language processing (NLP) is a field that uses computer science techniques to understand and work with human languages. NLP involves preprocessing text through steps like normalization, tokenization, removing stop words, stemming and lemmatization. It represents text numerically using methods like bag-of-words and word embeddings. Sequence modeling with RNNs, GRUs and LSTMs is used for tasks like machine translation and conversational AI. NLP has many applications including machine translation, conversational assistants and analyzing large amounts of text data.
This document discusses using machine learning techniques like neural networks to help decipher ancient scripts and languages. It describes how character-level sequence-to-sequence models can be used to identify cognates between related languages. Additional techniques like network flows and dynamic programming are used to model monotonic character alignments and jointly segment and match tokens between known and unknown languages. The approaches are able to identify cognates between languages like Ugaritic and Hebrew as well as segment and match the unknown Iberian language. Neural models that incorporate linguistic features like phonological embeddings are shown to improve decipherment performance.
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
This document provides an introduction to natural language processing (NLP). It discusses key topics in NLP including languages and intelligence, the goals of NLP, applications of NLP, and general themes in NLP like ambiguity in language and statistical vs rule-based methods. The document also previews specific NLP techniques that will be covered like part-of-speech tagging, parsing, grammar induction, and finite state analysis. Empirical approaches to NLP are discussed including analyzing word frequencies in corpora and addressing data sparseness issues.
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Jinho Choi
This document summarizes a research paper that proposes a new transformer model for span-based question answering on dialogue transcripts. The model is pretrained on tasks like masked language modeling at the token and utterance level, as well as utterance order prediction, using the Friends TV show transcript corpus. It is then fine-tuned jointly on two tasks: utterance ID prediction and token span prediction. Evaluation on the FriendsQA dataset shows the proposed model outperforms BERT and RoBERTa baselines. However, analysis finds the model still struggles with inference in dialogues and representing speakers.
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
This document provides an overview of statistical natural language processing (NLP). It begins with introducing the speaker, Mona Diab, and their research interests in NLP. It then discusses the growing amount of digital data being produced and the potential for machines to process and understand human language. However, language is complex with ambiguity, and good NLP solutions require both linguistic and machine learning knowledge. The document outlines some of the goals and challenges of NLP, including resolving ambiguity, and provides examples of NLP applications and techniques like probabilistic models built from language data.
Aibdconference chat bot for every product Maksym VolchenkoOlga Zinkevych
This document discusses conversational interfaces and chatbots. It begins with an introduction to the author's background and experience in artificial intelligence and as an Android developer. It then discusses why conversational interfaces are becoming more popular as people prefer interacting with products and services through human conversation versus many separate apps. Chatbots provide a cross-platform solution for this using natural language processing. The document defines key terms like bots, chatbots, virtual assistants and describes common NLP techniques. It provides examples of chatbot architectures and development tools like API.ai and discusses analytics. It concludes that while conversing with bots is interesting, human interaction is more meaningful.
A text extraction workshop delivered by Cameron Buckner on Friday, October 18th, 2012 as part of the University of Houston Digital Humanities Initiative.
6. For adjusting and testing the approach, a dataset was created
First comprehensible, publicly available research database for CS
Three databases based on extracts of three manuscripts
1. Saint Gall DB: Abbey Library of St. Gall, Cod. Sang. 562, Carolingian
script, Latin, 9th century (60 pages, 30 for learning).
2. Parzival DB: Abbey Library of St. Gall, Cod. Sang. 857, Gothic script,
Middle High German, 13th century (47 pages, 23 for learning).
3. George Washington DB: Library of Congress, G. W. Papers, longhand,
English, 18th century (20 pages, 10 for learning).
1 2 3
15. Modern English
Scanned printed text
5% & 20% char error rates
15
In withdrawing the riskless
principal mark‐up
disclosure proposal in the
1978 Release, the
Commission stated that it
would ''maintain close
scrutiny to prevent
excessive mark‐ups and
take enforcement action
where appropriate.''
ln withdrawlng the risyless
principal mary‐up
disclosure proposal in the
191W helease1 the
Commission stated that it
would 44maintain close
scrutiny to prevent
excessive mary‐ups and
taye enforcement action
where appropriate.:: 20
fa ‐thtlrawing the WfUefqs
priucipA mary‐up
dRclosure proposA in the
191@ M,lease, the
ComMssioa stated that it
would amUntdn close
scrutAy to preveat
excessive m=y‐upqe at nd
tttes eaforcemebt actioa
where approphate.. 2e 0
25. Manual transcription
dem man dirre aventivre giht
Searched text (BW)
dem zein zem dan den gein win man min dine dirre chrîe dirz dane
Amis dîner aventivre daventivre Aventivre giht gibt
27. MRR (Mean Reciprocal Rank)
The inverse of the rank of the first relevant item
retrieved
Reflects the user concern wishing to find one or a
few good responses to a given request
In other words…
Every searcher’s dream:
The top search result
is what s/he’s looking for!
RR=1
RR=1/2
RR=1/3
.
.
.
RR=0
27
30. 30
man # 36006.7
min # 35656.8
mat # 35452.5
nam # 35424.7
arm # 35296.2
nimt # 35278.2
gan # 35265.7
nam # 39678.5
mann # 39166.9
mit # 39134.9
mat # 39133.0
manz # 39001.1
man # 38997.0
mit # 38974.4
mat # 50135.5
nam # 50115.2
man # 50111.4
min # 50056.5
ram # 50056.4
nimt # 49839.0
mine # 49837.9
...
“dem man dirre
aventivre giht”
“iwer oder
decheines man”
“als man von siner
helfe saget“
31. 31
man
man 39
min 18.02
mat 9.51
nam 5.4
miren 4
manz 3.16
maze 2.35
mann 2.08
dran 2.03
maz 1.75
dan 1.73
maht 1.65
mal 1.23
minen 0.96
erlan 0.84
meine 0.82
gan 0.81
han 0.75
man
min
mat
nam
arm
nimt
gan
nam
mann
mit
mat
manz
man
mit
mat
nam
man
min
ram
nimt
mine
(+) (+)(+) ...
1) Calculate scores
Based on frequency & ranking within each subset
2) Sort accordingly
37. Layout analysis: text line extraction with 8% error in Latin
manuscripts.
Towards computer assisted paleography for complex documents
Handwriting recognition: transcription with 6% word error in
SG30 and PAR23, 18% word error in GW10.
Towards text alignment and word spotting
Information retrieval: degradation of 5% for PAR23.
Towards more challenging problems
Integrate the HisDoc outcomes into tools useful for practice
We are open for new collaborations in integrated and application
oriented projects
Our methods can be integrated in your tools!
37
38.
39. Printed modern English
5% error rate (character)
IR degradation ‐17%
20% error rate (character)
IR degradation ‐46%
Handwritten 13th century German
6% error rate (word)
▪ Clean queries (Q): IR degradation ‐5%
▪ Noisy queries (Q*): IR degradation ‐100%
▪ Noisy queries –Expanded (Q*E): IR degradation ‐14%
100
90
80
70
60
50
40
30
20
10
0
IR Degradation [%]
Q Q* Q*E
5% 20%
Modern English
Printed
Middle High German
Handwritten
6%
39