FreeDict is an open source project that hosts free bilingual dictionaries for African languages. Dictionaries uploaded to FreeDict can be encoded using XML standards like TEI P5, and can then be accessed through desktop clients or a Firefox add-on. The article discusses the development process for dictionaries on FreeDict, from simple glossaries to more complex machine-readable formats, using the example of the Swahili-English dictionary. Future plans include adding more XML features and tools to facilitate dictionary development and access.
Seeing is Correcting:Linked Open Data for PortugueseValeria de Paiva
OpenWordNet-PT is an open-source lexical database for Portuguese based on Princeton WordNet. It aims to create a Portuguese wordnet linked to Princeton's architecture while including senses specific to Portuguese. The resource is developed through manual and semi-automatic processes and contains over 50,000 synsets. It is distributed as RDF and accessible through a SPARQL endpoint and web interface that allows user suggestions and voting to improve data quality. Future work includes expanding coverage through corpora and improving relations between synsets.
This presentation introduces working recommendations for encoding etymological information in TEI P5 dictionaries. Herein an overview of a reformed package of elements attributes and structures is given for a revamping of TEI as per the ongoing project seeking a general overhaul of TEI dictionaries at INRIA France. Central to this is the need to create an LMF compatible set of TEI structures which is a long needed step forward in the field of lexical markup.
This presentation demonstrates ways to encode information that is central to linguistics but have not previously been encoded in any known TEI project such as: metaphor, metonymy, phonological changes, to name a few. Additionally demonstrated are structures meant to improve upon the existing ways that lexicographical resources are encoded in TEI such as etymological dictionaries.
Linguistic markup and transclusion processing in XML documentsSimon Dew
Transclusion can have linguistic consequences. This presentation proposes a markup scheme that can be used to indicate [a] the required form (e.g. syntactic case) of a transcluded term in an XML document, and [b] the syntactic features of the transcluded term that demand agreement in the surrounding document. It also describes a set of XSLT transformations that can be used to select the correct form of any dependent words in the surrounding document, using dictionaries that conform to the TEI.
This document summarizes Valeria de Paiva's talk on Portuguese linguistic tools. It discusses the goals in 2010 to develop natural language processing tools for Portuguese, including content analysis, text understanding, generation, summarization, dialogue systems and question answering. It outlines the challenges in developing these tools for Portuguese, particularly the lack of lexical resources like WordNet. Much of the work since 2010 has focused on developing OpenWordNet-PT as a key lexical resource. The document also discusses using this resource to build representations of text and enable basic inference through a framework called KIML.
Valeria de Paiva discusses the need for knowledge resources like ontologies for the Portuguese language. While Portuguese is the 6th most spoken language, existing lexical and semantic resources are insufficient for natural language understanding in Portuguese. The OpenWordnet-PT project has created a Portuguese wordnet that is used in applications like Google Translate but requires further improvement. Other relevant resources for Portuguese include the UD corpus and SICK-BR corpus. Machine learning presents both opportunities if knowledge can be incorporated explainably, as well as challenges if it is seen as a "black box" solution.
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Christophe Tricot
This document compares a knowledge-poor probabilistic approach and a knowledge-rich rule-based approach for multilingual terminology extraction. It describes the methodology and resources used for each approach. The knowledge-poor approach uses distributional clustering to induce part-of-speech tags and conditional random fields to extract candidate terms from small annotated corpora. The knowledge-rich approach relies on hand-crafted patterns and rules based on part-of-speech tags to identify terms and variants. Both approaches are evaluated on terminology extraction for six languages from comparable corpora in the domains of wind energy and mobile technologies, using reference term lists.
Terminology as a Service – a model for collaborative terminology managementTERMCAT
Terminology as a Service – a model for collaborative terminology management
Klaus-Dirk Schmitz - Cologne University of Applied Sciences
Tatiana Gornostay - Tilde, Riga
VII EAFT Terminology Summit. Barcelona, 27-28 november 2014
Seeing is Correcting:Linked Open Data for PortugueseValeria de Paiva
OpenWordNet-PT is an open-source lexical database for Portuguese based on Princeton WordNet. It aims to create a Portuguese wordnet linked to Princeton's architecture while including senses specific to Portuguese. The resource is developed through manual and semi-automatic processes and contains over 50,000 synsets. It is distributed as RDF and accessible through a SPARQL endpoint and web interface that allows user suggestions and voting to improve data quality. Future work includes expanding coverage through corpora and improving relations between synsets.
This presentation introduces working recommendations for encoding etymological information in TEI P5 dictionaries. Herein an overview of a reformed package of elements attributes and structures is given for a revamping of TEI as per the ongoing project seeking a general overhaul of TEI dictionaries at INRIA France. Central to this is the need to create an LMF compatible set of TEI structures which is a long needed step forward in the field of lexical markup.
This presentation demonstrates ways to encode information that is central to linguistics but have not previously been encoded in any known TEI project such as: metaphor, metonymy, phonological changes, to name a few. Additionally demonstrated are structures meant to improve upon the existing ways that lexicographical resources are encoded in TEI such as etymological dictionaries.
Linguistic markup and transclusion processing in XML documentsSimon Dew
Transclusion can have linguistic consequences. This presentation proposes a markup scheme that can be used to indicate [a] the required form (e.g. syntactic case) of a transcluded term in an XML document, and [b] the syntactic features of the transcluded term that demand agreement in the surrounding document. It also describes a set of XSLT transformations that can be used to select the correct form of any dependent words in the surrounding document, using dictionaries that conform to the TEI.
This document summarizes Valeria de Paiva's talk on Portuguese linguistic tools. It discusses the goals in 2010 to develop natural language processing tools for Portuguese, including content analysis, text understanding, generation, summarization, dialogue systems and question answering. It outlines the challenges in developing these tools for Portuguese, particularly the lack of lexical resources like WordNet. Much of the work since 2010 has focused on developing OpenWordNet-PT as a key lexical resource. The document also discusses using this resource to build representations of text and enable basic inference through a framework called KIML.
Valeria de Paiva discusses the need for knowledge resources like ontologies for the Portuguese language. While Portuguese is the 6th most spoken language, existing lexical and semantic resources are insufficient for natural language understanding in Portuguese. The OpenWordnet-PT project has created a Portuguese wordnet that is used in applications like Google Translate but requires further improvement. Other relevant resources for Portuguese include the UD corpus and SICK-BR corpus. Machine learning presents both opportunities if knowledge can be incorporated explainably, as well as challenges if it is seen as a "black box" solution.
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Christophe Tricot
This document compares a knowledge-poor probabilistic approach and a knowledge-rich rule-based approach for multilingual terminology extraction. It describes the methodology and resources used for each approach. The knowledge-poor approach uses distributional clustering to induce part-of-speech tags and conditional random fields to extract candidate terms from small annotated corpora. The knowledge-rich approach relies on hand-crafted patterns and rules based on part-of-speech tags to identify terms and variants. Both approaches are evaluated on terminology extraction for six languages from comparable corpora in the domains of wind energy and mobile technologies, using reference term lists.
Terminology as a Service – a model for collaborative terminology managementTERMCAT
Terminology as a Service – a model for collaborative terminology management
Klaus-Dirk Schmitz - Cologne University of Applied Sciences
Tatiana Gornostay - Tilde, Riga
VII EAFT Terminology Summit. Barcelona, 27-28 november 2014
Firefox Extension Development | By JIIT OSDCVaidik Kapoor
The document provides information about starting extension development and covers topics like HTML, XML, CSS, and JavaScript. It explains that OSDC at JIIT promotes open source software usage among students and provides basic overviews of markup languages like HTML and XML as well as styling with CSS. Examples are given for each topic to illustrate common tags and structures.
What is hot on the web right now - A W3C perspectiveArmin Haller
HTTP and HTML and the Web itself enter their third decade of existence. Still, the Web continues to transform human communication, information sharing, commerce, education, and entertainment. Social networking, cloud computing, and the convergence of Web, television, video and online gaming are among the phenomena stretching the Web in exciting new directions. In this talk, Armin will present what the World Wide Web Consortium (W3C), overlooking and steering the development of new Web standards is up to for the third decade of the Web. The W3C community is building an Open Web Platform that will enable the Web to grow and foster future innovation. This presentation present technology highlights of 2011 for advancing the Web platform. Focus topics of this talk will be the new HTML5 standard, the Data for Web Applications initiative which includes the next generation of RDF, and standards that allow people to create Semantic Web enabled Web Apps that have access to data from a variety of sources, including data-in-documents (RDFa) and data-from-databases (W3C's RDB2RDF).
The document discusses the multilingual web and internationalization (I18N) and localization (L10N) topics. It covers traditional topics like language tags and internationalized domain names. It also discusses newer topics like the "long tail" effect and its consequences for a multilingual web with more specific content and services. Metadata standards like XLIFF and ITS 1.0 help bridge technology gaps and make localization of long tail content easier and more affordable. The document advocates for content authors and developers to use standards like ITS to better enable computer-aided translation tools.
Fedora is an open-source digital object repository system that provides persistent storage and delivery of digital content. It is implemented as a set of Java services and stores content and associated metadata in XML files. The repository can scale to support millions of objects and provides features such as versioning, audit trails and triple store capabilities through integrated systems like Mulgara.
This document introduces the Eclipse Infocenter content management system. It provides an overview of Infocenter's capabilities including organizing content into books and topics, customizing the user interface, deploying on servers, searching content, integrating with other systems, and supporting multiple languages. The document demonstrates how to set up and configure an Infocenter instance to manage technical documentation and help systems.
The document discusses developing free language technologies for lesser-resourced languages. It provides examples of existing free projects in Welsh, including translation and authoring aids, terminology dictionaries, and Welsh text-to-speech. It also outlines priorities and challenges in funding and developing such technologies in a sustainable way through open-source approaches and sharing resources between projects.
The document discusses tools and processes for creating and maintaining documentation. It introduces Sphinx, an open source documentation generator that supports multiple output formats. Sphinx has an easy-to-learn formatting language and features such as automatic generation of tables of contents. It also has some limitations that require CSS workarounds. The document provides tips for getting started with Sphinx, such as experimenting with themes, reviewing extensions, and using example documentation projects.
This document provides an overview of OpenSocial, including its benefits for application developers and social networks, examples of how to build OpenSocial applications using various APIs, and resources for OpenSocial developers.
Magnus Enger discusses his experience using and developing free/libre and open source software (F/LOSS) for Norwegian libraries. He summarizes several F/LOSS projects he has worked on, including Reaktor (a social network), Sublima (a subject portal tool), Pode (library mashups and linked data projects), and Glitre (middleware for Z39.50/SRU). He advocates contributing code changes upstream to benefit the wider community, and argues that F/LOSS allows libraries to experiment and adapt more quickly compared to proprietary software.
Design and Flutter Day is a dynamic and engrossing occasion that unites the vibrant fields of Flutter development and design. The goal of this all-day event is to investigate the relationship between design aesthetics and Flutter, an effective open-source UI software development toolkit.
substrate: A framework to efficiently build blockchainsservicesNitor
Substrate is an open and interoperable blockchain framework that helps developers focus on the business logic of the chain and easily build multiple blockchains. Read our blog to learn more about it.
The document discusses several internet protocols including:
- IP which delivers data packets between hosts and includes addressing tags through encapsulation.
- TCP/IP which establishes communication between networks and provides host access to the internet.
- Ipv4 and Ipv6 which are internet protocols for carrying data packets with Ipv6 supporting more nodes.
- Early protocols for file retrieval like FTP, Gopher, and Telnet which allowed downloading and using remote files and applications with varying levels of description.
This document discusses using Sphinx, an open source documentation tool, to manage technical documentation. It outlines some common issues with documentation like outdated or incomplete docs and a lack of priority given to writing docs. It then describes features of Sphinx like multiple output formats, automatic generation of tables of contents, and an easy-to-learn markup language. Finally, it provides tips for setting up Sphinx including determining documentation needs, versioning options, and reviewing available themes.
The document discusses IMS Learning Tools Interoperability (LTI), an educational standard that allows external applications to be securely integrated into online learning systems. It provides an overview of LTI versions including 1.0, 1.1, and the upcoming 2.0, outlining their key features around basic tool launching, grade return, and improved tool registration respectively. It also mentions the adoption of LTI and IMS Common Cartridge standards by various learning management systems and tool vendors.
The Rhizomer Semantic Content Management SystemRoberto García
The Rhizomer platform is a Content Management System (CMS) based on a Resource Oriented Approach (RESTful) and Semantic Web technologies. It achieves a great level of flexibility and provides sophisticated content management services. All content is described using semantic metadata semi-automatically extracted from multimedia content, which enriches the browsing experience and enables semantic queries. A usable user interface is built on top of the CMS in order to facilitate the interaction with content and enhance it with the information provided by the associated semantic metadata. As an application scenario of the platform, its use in a media company where audio content is managed and its speech transcript semantically annotated is described.
HTML5 is the latest revision of the HTML standard that aims to improve the language with support for the latest multimedia capabilities. It introduces many new features such as geolocation, web storage, web sockets, and canvas for 2D and 3D graphics. Developers can use these new features to build richer applications that work across devices while also improving semantics and markup.
Slides from a talk on "Accessibility, Automation and Metadata" given at a WAI meeting held in Toronto in 1999.
See http://www.ukoln.ac.uk/web-focus/accessibility/metadata/www8/
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
This project aims to build and maintain a lexicographical resource for French-based Creole languages through three main steps:
1) Compiling existing lexicographical resources like dictionaries into an electronic format
2) Creating corpora of Creole language texts from literary, educational and journalistic sources online
3) Maintaining the dictionary by analyzing the corpora to identify unknown words and improve the database through an iterative process.
The results will be a lexicographical database detailing variations in French-based Creoles and an annotated corpora for linguistic research.
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
This document summarizes research that semi-automatically extracted a morphological grammar for Southern Ndebele, an under-resourced language, from a general Nguni morphological analyzer bootstrapped from a Zulu analyzer. The Southern Ndebele analyzer produced surprisingly good results, showing significant similarities across Nguni languages that can accelerate documentation and resource development for these languages. The project followed best practices for encoding resources to ensure sustainability, access, and adaptability to future formats and platforms.
More Related Content
Similar to A Repository of Free Lexical Resources for African Languages: The Project and the Method
Firefox Extension Development | By JIIT OSDCVaidik Kapoor
The document provides information about starting extension development and covers topics like HTML, XML, CSS, and JavaScript. It explains that OSDC at JIIT promotes open source software usage among students and provides basic overviews of markup languages like HTML and XML as well as styling with CSS. Examples are given for each topic to illustrate common tags and structures.
What is hot on the web right now - A W3C perspectiveArmin Haller
HTTP and HTML and the Web itself enter their third decade of existence. Still, the Web continues to transform human communication, information sharing, commerce, education, and entertainment. Social networking, cloud computing, and the convergence of Web, television, video and online gaming are among the phenomena stretching the Web in exciting new directions. In this talk, Armin will present what the World Wide Web Consortium (W3C), overlooking and steering the development of new Web standards is up to for the third decade of the Web. The W3C community is building an Open Web Platform that will enable the Web to grow and foster future innovation. This presentation present technology highlights of 2011 for advancing the Web platform. Focus topics of this talk will be the new HTML5 standard, the Data for Web Applications initiative which includes the next generation of RDF, and standards that allow people to create Semantic Web enabled Web Apps that have access to data from a variety of sources, including data-in-documents (RDFa) and data-from-databases (W3C's RDB2RDF).
The document discusses the multilingual web and internationalization (I18N) and localization (L10N) topics. It covers traditional topics like language tags and internationalized domain names. It also discusses newer topics like the "long tail" effect and its consequences for a multilingual web with more specific content and services. Metadata standards like XLIFF and ITS 1.0 help bridge technology gaps and make localization of long tail content easier and more affordable. The document advocates for content authors and developers to use standards like ITS to better enable computer-aided translation tools.
Fedora is an open-source digital object repository system that provides persistent storage and delivery of digital content. It is implemented as a set of Java services and stores content and associated metadata in XML files. The repository can scale to support millions of objects and provides features such as versioning, audit trails and triple store capabilities through integrated systems like Mulgara.
This document introduces the Eclipse Infocenter content management system. It provides an overview of Infocenter's capabilities including organizing content into books and topics, customizing the user interface, deploying on servers, searching content, integrating with other systems, and supporting multiple languages. The document demonstrates how to set up and configure an Infocenter instance to manage technical documentation and help systems.
The document discusses developing free language technologies for lesser-resourced languages. It provides examples of existing free projects in Welsh, including translation and authoring aids, terminology dictionaries, and Welsh text-to-speech. It also outlines priorities and challenges in funding and developing such technologies in a sustainable way through open-source approaches and sharing resources between projects.
The document discusses tools and processes for creating and maintaining documentation. It introduces Sphinx, an open source documentation generator that supports multiple output formats. Sphinx has an easy-to-learn formatting language and features such as automatic generation of tables of contents. It also has some limitations that require CSS workarounds. The document provides tips for getting started with Sphinx, such as experimenting with themes, reviewing extensions, and using example documentation projects.
This document provides an overview of OpenSocial, including its benefits for application developers and social networks, examples of how to build OpenSocial applications using various APIs, and resources for OpenSocial developers.
Magnus Enger discusses his experience using and developing free/libre and open source software (F/LOSS) for Norwegian libraries. He summarizes several F/LOSS projects he has worked on, including Reaktor (a social network), Sublima (a subject portal tool), Pode (library mashups and linked data projects), and Glitre (middleware for Z39.50/SRU). He advocates contributing code changes upstream to benefit the wider community, and argues that F/LOSS allows libraries to experiment and adapt more quickly compared to proprietary software.
Design and Flutter Day is a dynamic and engrossing occasion that unites the vibrant fields of Flutter development and design. The goal of this all-day event is to investigate the relationship between design aesthetics and Flutter, an effective open-source UI software development toolkit.
substrate: A framework to efficiently build blockchainsservicesNitor
Substrate is an open and interoperable blockchain framework that helps developers focus on the business logic of the chain and easily build multiple blockchains. Read our blog to learn more about it.
The document discusses several internet protocols including:
- IP which delivers data packets between hosts and includes addressing tags through encapsulation.
- TCP/IP which establishes communication between networks and provides host access to the internet.
- Ipv4 and Ipv6 which are internet protocols for carrying data packets with Ipv6 supporting more nodes.
- Early protocols for file retrieval like FTP, Gopher, and Telnet which allowed downloading and using remote files and applications with varying levels of description.
This document discusses using Sphinx, an open source documentation tool, to manage technical documentation. It outlines some common issues with documentation like outdated or incomplete docs and a lack of priority given to writing docs. It then describes features of Sphinx like multiple output formats, automatic generation of tables of contents, and an easy-to-learn markup language. Finally, it provides tips for setting up Sphinx including determining documentation needs, versioning options, and reviewing available themes.
The document discusses IMS Learning Tools Interoperability (LTI), an educational standard that allows external applications to be securely integrated into online learning systems. It provides an overview of LTI versions including 1.0, 1.1, and the upcoming 2.0, outlining their key features around basic tool launching, grade return, and improved tool registration respectively. It also mentions the adoption of LTI and IMS Common Cartridge standards by various learning management systems and tool vendors.
The Rhizomer Semantic Content Management SystemRoberto García
The Rhizomer platform is a Content Management System (CMS) based on a Resource Oriented Approach (RESTful) and Semantic Web technologies. It achieves a great level of flexibility and provides sophisticated content management services. All content is described using semantic metadata semi-automatically extracted from multimedia content, which enriches the browsing experience and enables semantic queries. A usable user interface is built on top of the CMS in order to facilitate the interaction with content and enhance it with the information provided by the associated semantic metadata. As an application scenario of the platform, its use in a media company where audio content is managed and its speech transcript semantically annotated is described.
HTML5 is the latest revision of the HTML standard that aims to improve the language with support for the latest multimedia capabilities. It introduces many new features such as geolocation, web storage, web sockets, and canvas for 2D and 3D graphics. Developers can use these new features to build richer applications that work across devices while also improving semantics and markup.
Slides from a talk on "Accessibility, Automation and Metadata" given at a WAI meeting held in Toronto in 1999.
See http://www.ukoln.ac.uk/web-focus/accessibility/metadata/www8/
Similar to A Repository of Free Lexical Resources for African Languages: The Project and the Method (20)
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
This project aims to build and maintain a lexicographical resource for French-based Creole languages through three main steps:
1) Compiling existing lexicographical resources like dictionaries into an electronic format
2) Creating corpora of Creole language texts from literary, educational and journalistic sources online
3) Maintaining the dictionary by analyzing the corpora to identify unknown words and improve the database through an iterative process.
The results will be a lexicographical database detailing variations in French-based Creoles and an annotated corpora for linguistic research.
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
This document summarizes research that semi-automatically extracted a morphological grammar for Southern Ndebele, an under-resourced language, from a general Nguni morphological analyzer bootstrapped from a Zulu analyzer. The Southern Ndebele analyzer produced surprisingly good results, showing significant similarities across Nguni languages that can accelerate documentation and resource development for these languages. The project followed best practices for encoding resources to ensure sustainability, access, and adaptability to future formats and platforms.
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
This document proposes a bag-of-substrings approach to part-of-speech tagging for under-resourced Bantu languages using available digital dictionaries and word lists instead of large annotated corpora. Experimental results showed the technique established a low-resource, high accuracy method for bootstrapping POS tagging that compares favorably to state-of-the-art data-driven approaches. The method extracts substring features from words to train a maximum entropy classifier and bootstrap POS tagging for Bantu languages that lack extensive annotated resources.
Natural Language Processing for Amazigh LanguageGuy De Pauw
The document discusses natural language processing challenges for the Amazigh (Berber) language. It outlines Amazigh language characteristics like its writing system and complex phonology/morphology. It then describes the current state of Amazigh NLP technology, including Tifinaghe encoding, optical character recognition tools, basic processing tools like transliterators and stemmers, and language resources like corpora and dictionaries. Finally, it proposes future directions such as developing larger corpora, machine translation systems, and growing human resources for Amazigh language technology.
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
This document presents a new 50+ million word corpus of the Tajik language, the largest available. It was created by crawling over a dozen Tajik news websites and other sources. The texts were joined and cleaned to remove duplicates. The corpus was then annotated with morphological analysis of Tajik using a new analyzer created by modifying an existing one to be faster and allow lemmatization. The analyzer recognizes over 87% of words and tags them with part of speech. This annotated corpus containing lemmas, tags and frequencies is available online through the Sketch Engine for researchers.
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
This document describes using a metagrammar called XMG to formally capture morphological generalizations of verbs in the Ikota language. It provides an XMG specification for Ikota verbal morphology that describes subject, tense, verb root, aspect, active voice, and proximity. This specification can automatically derive a lexicon of fully inflected verb forms. The methodology allows for quickly testing ideas and validating results against language data.
Tagging and Verifying an Amharic News CorpusGuy De Pauw
This document summarizes an Amharic news corpus tagging and verification project. It discusses the Amharic language background, the corpus creation from Ethiopian news sources, the manual tagging process, previous tagging experiments, and the current efforts to clean and re-tag the corpus which involves removing errors and inconsistencies from the original tagging. Baseline tagging performance on the corpus using different part-of-speech tagsets ranges from 58.3% to 90.8% correct depending on the tagset and machine learning approach used.
This document describes the process of constructing a corpus of spoken and written Santome, a Portuguese-related creole language spoken in Sao Tome and Principe. The corpus contains over 184,000 words from written sources like newspapers and books, as well as transcribed spoken recordings. Efforts were made to standardize the orthography and develop part-of-speech tags for annotation. Metadata is encoded for each text, and the corpus will be made available through a concordancing tool to allow searches while copyright permissions are obtained. The goal is for this and related Gulf of Guinea creole corpora to enable comparative linguistic research.
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
The document describes a system for automatically structuring and correcting Hungarian clinical records. The system separates clinical records into structured XML elements, tags metadata, and separates text from tables. It also corrects spelling errors using language models and weighted edit distances to generate and score candidate corrections. Evaluation showed the system could provide the right correction in the top 5 suggestions for 99% of errors. Areas for improvement include handling insertion/deletion errors and using larger language resources to better handle non-standard usage.
Compiling Apertium Dictionaries with HFSTGuy De Pauw
This document discusses compiling Apertium dictionaries with HFST to leverage generalised compilation formulas and get more applications from fewer language descriptions. Compiling Apertium dictionaries natively in HFST provides benefits like uniform compilation across tools, improved resulting automata using HFST algorithms, and integrated complex finite-state morphology features. Additional applications like spellcheckers can also be automatically generated from the dictionaries.
The Database of Modern Icelandic InflectionGuy De Pauw
The Database of Modern Icelandic Inflection (DMII) is a database that stores the full inflectional forms of Icelandic words. It contains over 5.8 million inflected forms. The DMII aims to represent Icelandic inflection accurately without overgeneration by including all inflected forms and variants. A rule-based system was not feasible due to insufficient data and the tendency for rules to overgenerate. The DMII supports language technology projects and is accessible online for the general public.
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
This document discusses learning Amharic verb morphology using inductive logic programming (ILP). Amharic verbs are complex, conveying information about subject, object, tense, aspect, mood and more through affixation, reduplication and compounding. The authors apply ILP to learn morphological rules from a training set of 216 Amharic verbs. They achieve 86.9% accuracy on a test set of 1,784 verb forms. Key challenges include a lack of similar examples in the training data and learning inappropriate alternation rules. This work contributes to advancing the automatic learning of morphology for under-resourced languages like Amharic.
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
The document discusses the design of a corpus of spoken Irish. It outlines the linguistic background of Irish and issues with existing spoken language resources. It then describes the pilot corpus, including data collection from podcasts and conversations. Transcription guidelines were adapted from CHAT and LDC conventions to balance accuracy with transcription speed. The goal is to create a large, balanced corpus to support research and language preservation.
How to build language technology resources for the next 100 yearsGuy De Pauw
The document discusses how to build sustainable language technology resources for lesser-resourced languages over the next 100 years. It outlines an vision of linguistic diversity and language survival. Key challenges include limited resources, small language communities, and technological limitations. Approaches proposed to work around these include minimizing redundant work, maximizing reuse of resources, building user and developer communities, and preparing resources to work with future technologies. Specific topics covered are types of language technology resources, issues around character encoding, text input methods, and future-proofing keyboard layouts and recognition technologies for many languages.
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
The document proposes standardizing test sets for evaluating compound analyzers by establishing parameters for a standard test set. It discusses evaluating compound analyzers on different sized test sets containing compound words, non-compound words, and error words. Experiments compare analyzer performance on test sets of varying sizes, finding sizes below 250 words are too small and sizes above 1250 words show no significant differences in results. The proposed standard test set consists of 500 examples each of compounds, non-compounds, and errors for a total of 1500 words.
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
The document discusses new paradigms for developing African language resources through the Pan African Living Dictionary Online (PALDO) project. The paradigms include open community participation under scholarly supervision, paying for data development and making the data freely available, and linking monolingual dictionaries for multiple languages by concept to create rich resources for each language.
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
1. The document presents a system for recognizing handwritten Yoruba characters using a Bayesian classifier and decision tree approach.
2. Key stages of the system include preprocessing, segmentation, feature extraction, Bayesian classification, decision tree processing, and result fusion.
3. The system was tested on independent and non-independent character samples, achieving recognition rates of 91.18% and 94.44% respectively.
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
The document describes the development of an English to Yoruba machine translation system called IFE-MT. It discusses the theoretical and practical issues in building the system, including the differences between the languages. It outlines the data collection and annotation process. It also describes the software tools and modules used to implement the system and demonstrates its capabilities. The system is being further developed by expanding the database and evaluating the translations.
A Number to Yorùbá Text Transcription SystemGuy De Pauw
The document describes a system for converting Arabic numbers to their Yoruba lexical equivalents. It discusses Yoruba numerals and their derivation using addition, subtraction and multiplication. A computational model is presented using pushdown automata to capture the number conversion. The system was implemented in Python and evaluated using Mean Opinion Score testing. Examples of number conversions like 19679 are provided to demonstrate the system.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
A Repository of Free Lexical Resources for African Languages: The Project and the Method
1. A repository of free lexical resources
for African languages:
the project and the method
Piotr Bański Beata Wójtowicz
Institute of English Studies, Dept. of African Languages and Cultures,
University of Warsaw University of Warsaw
E-mail: pkbanski@uw.edu.pl E-mail: b.wojtowicz@uw.edu.pl
Below are the possible stages of
Summary development of an example entry;
from the simplest glossary to
Our focus here is on FreeDict [http://www.freedict.org/], a project that has the potential to become home to,
something close to a machine-
among others, free bilingual dictionaries for African languages. The project is part of SourceForge.net.
processable lexical database (we
The dictionaries can be usable even in their early versions, which can be subject to further supervised skip xml:lang attributes).
improvement as user feedback accumulates: “publish early, publish often”, in the open-source way.
<entry>
We demonstrate a possible process of dictionary development on the example of one of FreeDict dictionaries <form><orth>alasiri</orth></form>
<def>afternoon</def>
– Swahili-English xFried/Freedict Dictionary – the first FreeDict dictionary encoded according to the TEI P5 </entry>
XML standard.
The final product can be accessed via desktop clients, via a Firefox add-on, or on the Web [http://dict.org].
<entry xml:id=quot;alasiriquot;>
DICT FreeDict <form><orth>alasiri</orth></form>
<gramGrp><pos>n</pos></gramGrp>
DICT (Dictionary Server Protocol; Faith and Martin 1997) is by now a <sense>
FreeDict was founded in 2000 as an expression of the <def>afternoon (period between 3
well-established TCP-based query/response protocol that allows a client natural open-source synergy with DICT: DICT provided p.m. and 5 p.m.)</def>
to access definitions from a set of various dictionary databases. It the platform for disseminating content of all kinds of </sense>
</entry>
provides data in textual form, but it also has the potential of providing dictionaries, while FreeDict grouped bilingual
MIME-encoded content. The dictionary server software, dictd, is dictionaries that could be disseminated on this platform.
maintained and developed at SourceForge. Later on, FreeDict adopted the TEI P4 XML format. At
The DICT format is a plain text format with an accompanying index file the moment, it has also basic support for TEI P5 (in the <entry xml:id=quot;alasiriquot;>
(an option of serving MIME content also exists). CVS only; this is work in progress). <form type=quot;Nquot;>
<orth>alasiri</orth>
There is more than one way to query a DICT database: you can search
FreeDict is the nexus of the following: </form>
the definitions and the headwords, using regex-based criteria. <gramGrp><pos>n</pos></gramGrp>
XML, with its potential for creating well-structured <sense>
The clients can be free-standing desktop applications or they can be
<def>afternoon</def>
documents,
integrated into editors or web browsers. DICT web gateways also exist. <note type=quot;defquot;>period between
TEI P5, an encoding standard taking advantage of this
The DICT project provides a list of clients and alternative servers. 3 p.m. and 5 p.m.</note>
</sense>
potential, </entry>
the SourceForge repository as well as distribution and
content-management network,
the DICT distribution network: apart from being able to
query DICT servers straight from the desktop, Firefox
<entry xml:id=quot;alasiriquot;>
users can also take advantage of an add-on client that <form type=quot;Nquot;>
returns definitions for words highlighted on a web page <orth>alasiri</orth>
</form>
(an example is shown to the left), <gramGrp><pos>n</pos></gramGrp>
FreeDict tools, as means to manipulate dictionaries and <sense>
<cit type=quot;transquot;>
to create, among others, the DICT format (usable <quote>afternoon</quote>
The screenshots above demonstrate the CSS “work view” of the dictionary directly from DICT servers and by other dictionary- <def>period between 3 p.m. and
(v. 0.4.1 of March 28th) and the way in which the Firefox add-on client providing projects, e.g., StarDict or Open Dict); the 5 p.m.</def>
presents query results (the mismatches are due to the incomplete support </cit>
build process provides targets for platforms other than </sense>
for TEI P5 in the FreeDict build system, originally designed for TEI P4; DICT, e.g. the Evolutionary Dictionary or zbedic. </entry>
support for P5 got introduced only in mid-March).
Each of the above is transformable
Additionally:
into a DICT-based dictionary,
Why Swahili-English? Lexical resources submitted to FreeDict will be able to
accessible locally or via the Internet.
undergo further transformations, such as reversal or
Just because we happen to be working on a Swahili-Polish-Swahili concatenation, which means that work put into
dictionary, and this is an offshoot of the testing phase of the project; developing a single resource may well be re-used in
we wanted to donate our test Swahili-English dictionary to FreeDict, developing others.
and this is how the entire adventure began. This dictionary (in versions
The project has its own distribution system, in the form
0.3 and 0.4) replaced the earlier dictionary by the same name that
And at every stage, the dictionary
of GNU/Linux packages.
Beata created from freely available GPL-ed sources.
content can be verified in the “work
Content published by FreeDict is guaranteed to be free.
But our point is that any dictionary of any size can be submitted! view”, provided by a CSS stylesheet,
as shown above.
On the right, we present an <entry>
<form>
example of an entry in a <orth>adui</orth>
<entry xml:id=quot;maaduiquot;>
dictionary with a somewhat <ref target=quot;#maaduiquot;/>
<form>
<entry xml:id=quot;aduiquot;> </form>
detailed amount of information <orth>maadui</orth>
<form> <gramGrp><pos>n</pos></gramGrp>
and granularity thereof. </form>
<orth>adui</orth> <sense>
<gramGrp>
</form> <def>enemy</def>
<pos>n</pos>
<xr type=quot;plural-formquot;> </sense>
</gramGrp>
The dictionary developer needs
<ref target=quot;#maaduiquot;>maadui</ref> <sense>
<sense>
</xr> <def>opponent</def>
to fill in simple templates for <xr type=quot;plural-sensequot;>Plural of
<gramGrp> <note type=quot;hintquot;>in games or
the relevant parts of speech. <ref target=quot;#aduiquot;>adui</ref>
<pos>n</pos> sports</note>
</xr>
</gramGrp> </sense>
<def>enemy</def>
<sense xml:id=quot;adui.1quot; n=quot;1quot;> </entry>
<def>opponent <note type=quot;hintquot;>in
<cit type=quot;transquot;>
games or sports</note></def>
Then, the predictable work is performed by XSLT scripts, which...
<quote>enemy</quote>
</sense>
</cit>
</entry>
</sense>
<sense xml:id=quot;adui.2quot; n=quot;2quot;>
<cit type=quot;transquot;>
(b) create new entries (in this case, a
(a) add XML structure
<quote>opponent</quote>
template plural entry, containing a
<note type=quot;hintquot;>in games or sports</note>
to the entry created
</sense>
reference to the singular form)
by a developer
</entry>
Selected references
Developments planned for the near future
Faith, Rik and Martin, Brett. 1997. A Dictionary Server Protocol. Request for Comments: 2229
After we reach version 0.5 with the cit/quote markup, we plan to start
(RFC #2229). Network Working Group. Available from ftp://ftp.isi.edu/in-notes/rfc2229.txt
experimenting with dictionary reversal and concatenation (crossing).
TEI Consortium, eds. 2007. TEI P5: Guidelines for Electronic Text Encoding and Interchange.
The support for LIFT (Lexicon Interchange FormaT) is next on the
Version 1.2.0. Last updated on October 31st 2008. TEI Consortium. Available from
agenda.
http://www.tei-c.org/Guidelines/P5/
More XML technology: tools for feeding dictionaries into, and querying
their contents from, native XML databases.
Language Technologies for African Languages (AfLaT 2009) Workshop, EACL 2009, Athens 31 March 2009