This document summarizes the results of a technology audit of official South African languages conducted to assess the state of human language technology (HLT) across the 11 official languages. The audit found basic HLT resources available for some languages but significant gaps remain. Recommendations include further developing resources identified as gaps, improving availability and distribution of existing resources, and obtaining government funding to support HLT, especially for under-resourced languages.
Spector, Alfred Z (Google) The Continuing Metamorphosis of ...butest
The document discusses the continuing evolution of the web and three major advances that may occur: totally transparent processing, distributed computing, and hybrid intelligence. It provides examples of how Google has achieved extraordinary results through a bottom-up evolutionary approach, including universal search, voice search, image recognition, and translation services. Technological improvements like increased data, computing power, and machine learning techniques are enabling new capabilities that were not possible before.
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
The document describes the development of an English to Yoruba machine translation system called IFE-MT. It discusses the theoretical and practical issues in building the system, including the differences between the languages. It outlines the data collection and annotation process. It also describes the software tools and modules used to implement the system and demonstrates its capabilities. The system is being further developed by expanding the database and evaluating the translations.
The document discusses automatic speech recognition, including identifying the main challenges such as variability in speech, separating speech from background noise, and disambiguating homophones. It describes early approaches using template matching and rule-based systems, and more modern statistical approaches using machine learning techniques like Hidden Markov Models and n-gram language models trained on large speech corpora. The statistical approaches have proven more successful by learning from data rather than relying on hand-crafted rules.
The session provides a presentation and status update on the IDN Variant TLD Program. Update includes progress made on the implementation of the IDN Root LGR Procedure, status of the Maximal Starting Repertoire, community and Generation Panels' updates.
Presentation by Rob Gill (Business Development and Support Manager - Te Pou - New Zealand) at the 'Plans to Reality Forum - Exploring the new planning process for people with disabilities", that was held on Tuesday 15 November 2011,
www.field.org.au
iCap Investment is a professional consulting firm in Central Asia providing advisory services to local and international clients. They help investors with investment projects in Central Asia, assist local companies to improve performance, and help international organizations with development projects. Their services include investment consulting, management consulting, and research. They have experience conducting feasibility studies, due diligence, business planning, and research for clients in various industries.
Spector, Alfred Z (Google) The Continuing Metamorphosis of ...butest
The document discusses the continuing evolution of the web and three major advances that may occur: totally transparent processing, distributed computing, and hybrid intelligence. It provides examples of how Google has achieved extraordinary results through a bottom-up evolutionary approach, including universal search, voice search, image recognition, and translation services. Technological improvements like increased data, computing power, and machine learning techniques are enabling new capabilities that were not possible before.
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
The document describes the development of an English to Yoruba machine translation system called IFE-MT. It discusses the theoretical and practical issues in building the system, including the differences between the languages. It outlines the data collection and annotation process. It also describes the software tools and modules used to implement the system and demonstrates its capabilities. The system is being further developed by expanding the database and evaluating the translations.
The document discusses automatic speech recognition, including identifying the main challenges such as variability in speech, separating speech from background noise, and disambiguating homophones. It describes early approaches using template matching and rule-based systems, and more modern statistical approaches using machine learning techniques like Hidden Markov Models and n-gram language models trained on large speech corpora. The statistical approaches have proven more successful by learning from data rather than relying on hand-crafted rules.
The session provides a presentation and status update on the IDN Variant TLD Program. Update includes progress made on the implementation of the IDN Root LGR Procedure, status of the Maximal Starting Repertoire, community and Generation Panels' updates.
Presentation by Rob Gill (Business Development and Support Manager - Te Pou - New Zealand) at the 'Plans to Reality Forum - Exploring the new planning process for people with disabilities", that was held on Tuesday 15 November 2011,
www.field.org.au
iCap Investment is a professional consulting firm in Central Asia providing advisory services to local and international clients. They help investors with investment projects in Central Asia, assist local companies to improve performance, and help international organizations with development projects. Their services include investment consulting, management consulting, and research. They have experience conducting feasibility studies, due diligence, business planning, and research for clients in various industries.
GTTS System for the Spoken Web Search Task at MediaEval 2012MediaEval2012
The document describes the GTTS system for the Spoken Web Search Task at MediaEval 2012. The GTTS system allows for searching spoken queries against broadcast news audio using automatic speech recognition and indexing. It also enables searching parliamentary sessions by aligning audio and text. For MediaEval 2012, the GTTS system performs phonetic matching between the n-best phone decoding of queries and phone lattices of spoken resources to locate query detections. Initial experiments showed poor performance, so the approach was changed to search for each query's single best detection in each audio document.
Audiovisual collections, the spoken word and user needs of scholars in the Hu...roelandordelman.nl
Audiovisual collections, the spoken word and user needs of scholars in the Humanities: Observations based on related work in The Netherlands 2005-2012. Presentation at the BL, Opening up Speech Archives, 08-02-2013
Terminology turbocharges your translation: From my archive before TaaS ;-)Tatjana Gornostaja
This document discusses terminology management tools and user needs. It summarizes several surveys of translators and language professionals that found spreadsheets and word processors were commonly used for terminology work due to the lack of suitable specialized tools. Respondents wanted user-friendly collaborative tools with consolidated access to terminology resources and consistency checking. The document presents Tilde's work developing online terminology databases and services, like EuroTermBank, that could be integrated into translation environments and shared across organizations to improve consistency and productivity for terminology work.
Automatic Term Recognition with Apache SolrJIE GAO
Automatic Term Extraction (ATE/ATR) is an important Natural Language Processing (NLP) task that deals with the extraction of terminologies from domain-specific textual corpora. JATE 2.0 integrates with Apache Solr framework to benefit from its extensive, extensible, flexible text processing libraries; it can either be used as a separate module, or as a Solr plugin used during document processing to enrich the indexed documents with candidate terms. DOI: 10.13140/RG.2.1.2897.3684
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project
Free Webinar on the Lynx Services Platform LySP: Architecture and basic Services
The main objective of the Lynx research and innovation project is to create an ecosystem of smart cloud services to better manage compliance, based on a Legal Knowledge Graph (LKG) which integrates and links multilingual and heterogeneous compliance data sources including legislation, case law, standards, regulations and other private contracts, beside others.
This webinar will provide insights into all smart services of the Lynx Services Platform (LySP) including demos of these LySP services, as for instance: Named Entity Extraction (NER) by DFKI, Relation Extraction and Question-Answering by SWC, Machine Translation by Tilde or the Lexicala cross-lingual lexical data service by KDictionaries.
The document summarizes TAUS Executive Forum 2015 which will take place on April 9-10 in Tokyo. It discusses TM-Town's mission to create a better translation world through technology and specialization. Key topics covered include job matching algorithms based on natural language processing of prior work, storing linguistic assets in a "bank", and TM-Town's translation enablement platform including unlimited private storage, term extraction, and job matching based on prior work. Areas for improving existing segmentation and alignment libraries are also mentioned.
This document discusses indexing and searching noisy data.
Part I provides examples of noisy data from spelling corrections, speech recognition, machine translation and optical character recognition where information is lost. Part II discusses emerging scenarios for scholarly use of noisy metadata including discovery and exploration of large digital collections. Part III argues that metadata mining can enhance accessibility and understanding by integrating different annotation types from manual, automatic and community sources.
This document discusses standardization in accessibility and the role of standards bodies and standards. It notes that standards can promote interoperability, sharing of information, and accessibility. Open standards that anyone can read and implement are important for ongoing accessibility improvements. Developing accessibility standards requires understanding needs, experience, and active implementations. Some key ICT accessibility standards discussed include WCAG, UAAG, ATAG, and ISO 13066. The document outlines how ÆGIS will define, implement, and identify gaps in accessibility standards.
The document outlines a three-tiered model for providing assistive technology tools to support writing and organization skills. Tier 1 involves tools and strategies readily available to teachers, such as graphic organizers, dictionaries, and planners. Tier 2 involves trials of specialized equipment guided by an assistive technology specialist. The document provides examples of tools, roles and responsibilities, and data collection methods for each tier.
This document provides an overview of a course on digital speech processing. The course will cover fundamentals of speech production and perception, as well as techniques for digital speech processing including short-time Fourier analysis and linear predictive coding methods. Applications that will be discussed include speech coding, synthesis, recognition, and other speech applications involving pattern matching problems. Students will learn about representations and algorithms for processing speech signals.
This document summarizes an online information conference held by the Food and Agriculture Organization of the United Nations (FAO) on using open source systems. The FAO is working to implement an open archive system using Fedora Commons to provide access to its publications and documents. While open source systems provide benefits like low costs, standards compliance and community support, the FAO's experience showed that such projects require adequate funding, staff capacity, cross-departmental collaboration and clear goals to be successful. Both advantages and disadvantages must be considered for open source versus commercial systems.
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionZachary S. Brown
This document provides an overview of automatic speech recognition systems and their components. It discusses:
- The main components of ASR systems including preprocessing/feature extraction, acoustic models, and language models.
- Unique challenges of working with speech data like data volume, quality, and annotation.
- Common modeling approaches used in ASR systems have historically included hidden Markov models and n-gram language models, but more recent approaches use end-to-end neural networks like deep speech, wav2vec, conformer and whisper models.
Building NLP solutions for Davidson ML Groupbotsplash.com
This document provides an overview of natural language processing (NLP) and discusses various NLP applications and techniques. It covers the scope of NLP including natural language understanding, generation, and speech recognition/synthesis. Example applications mentioned include chatbots, sentiment analysis, text classification, summarization, and more. Popular Python packages for NLP like NLTK, SpaCy, and Gensim are also highlighted. Techniques like word embeddings, neural networks, and deep learning approaches to NLP are briefly outlined.
- Monitors the students and helps the teacher in handling questions and feedback.
- Can connect students to speak and manage the queue.
- Can also take over the lesson and present content.
- Enables the teacher to focus on teaching without distractions.
- Supports up to 5 Teacher Assistants per lesson.
- Teacher Assistants can be assigned to monitor specific groups of students.
- Teacher Assistants have full access to student lists, attendance and reports.
- Teacher Assistants can be promoted to become teachers.
- Teacher Assistants can also record lessons.
•Features of an Open Standard LMS - MOODLE
MOODLE (
The document summarizes a presentation about ArchivesSpace, an open source archives management tool. It discusses ArchivesSpace's development process, technical framework, community involvement opportunities, and plans for ongoing support through the organizational home of LYRASIS. ArchivesSpace aims to provide a flexible, scalable, and standards-compliant tool for archival description and management.
The document provides an overview of using Python for bioinformatics, discussing what Python is, why it is useful for bioinformatics, how to set up Python in integrated development environments like Eclipse with PyDev, how to share code using Git and GitHub, and includes examples of Hello World and bioinformatics programs in Python. It introduces Python and argues it is well-suited for bioinformatics due to its extensive standard libraries, ease of use, and wide adoption in science. The document demonstrates how to install Python, set up an IDE, create and run simple Python programs, and use version control with Git and GitHub to collaborate on projects.
This document presents a Request for Proposal (RFP) for ESSENSE, which seeks submissions for a domain-specific language and a "Kernel of Essentials" to support flexible software engineering methods. The goal is to allow practitioners to define, refine, and customize their processes over time. The RFP outlines problems with traditional rigid processes, desires a framework like those used in agile development, and provides requirements for the Kernel (to define fundamental concepts) and the language (to compose practices into methods and represent work progress).
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
This project aims to build and maintain a lexicographical resource for French-based Creole languages through three main steps:
1) Compiling existing lexicographical resources like dictionaries into an electronic format
2) Creating corpora of Creole language texts from literary, educational and journalistic sources online
3) Maintaining the dictionary by analyzing the corpora to identify unknown words and improve the database through an iterative process.
The results will be a lexicographical database detailing variations in French-based Creoles and an annotated corpora for linguistic research.
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
This document summarizes research that semi-automatically extracted a morphological grammar for Southern Ndebele, an under-resourced language, from a general Nguni morphological analyzer bootstrapped from a Zulu analyzer. The Southern Ndebele analyzer produced surprisingly good results, showing significant similarities across Nguni languages that can accelerate documentation and resource development for these languages. The project followed best practices for encoding resources to ensure sustainability, access, and adaptability to future formats and platforms.
More Related Content
Similar to An HLT profile of the official South African languages
GTTS System for the Spoken Web Search Task at MediaEval 2012MediaEval2012
The document describes the GTTS system for the Spoken Web Search Task at MediaEval 2012. The GTTS system allows for searching spoken queries against broadcast news audio using automatic speech recognition and indexing. It also enables searching parliamentary sessions by aligning audio and text. For MediaEval 2012, the GTTS system performs phonetic matching between the n-best phone decoding of queries and phone lattices of spoken resources to locate query detections. Initial experiments showed poor performance, so the approach was changed to search for each query's single best detection in each audio document.
Audiovisual collections, the spoken word and user needs of scholars in the Hu...roelandordelman.nl
Audiovisual collections, the spoken word and user needs of scholars in the Humanities: Observations based on related work in The Netherlands 2005-2012. Presentation at the BL, Opening up Speech Archives, 08-02-2013
Terminology turbocharges your translation: From my archive before TaaS ;-)Tatjana Gornostaja
This document discusses terminology management tools and user needs. It summarizes several surveys of translators and language professionals that found spreadsheets and word processors were commonly used for terminology work due to the lack of suitable specialized tools. Respondents wanted user-friendly collaborative tools with consolidated access to terminology resources and consistency checking. The document presents Tilde's work developing online terminology databases and services, like EuroTermBank, that could be integrated into translation environments and shared across organizations to improve consistency and productivity for terminology work.
Automatic Term Recognition with Apache SolrJIE GAO
Automatic Term Extraction (ATE/ATR) is an important Natural Language Processing (NLP) task that deals with the extraction of terminologies from domain-specific textual corpora. JATE 2.0 integrates with Apache Solr framework to benefit from its extensive, extensible, flexible text processing libraries; it can either be used as a separate module, or as a Solr plugin used during document processing to enrich the indexed documents with candidate terms. DOI: 10.13140/RG.2.1.2897.3684
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project
Free Webinar on the Lynx Services Platform LySP: Architecture and basic Services
The main objective of the Lynx research and innovation project is to create an ecosystem of smart cloud services to better manage compliance, based on a Legal Knowledge Graph (LKG) which integrates and links multilingual and heterogeneous compliance data sources including legislation, case law, standards, regulations and other private contracts, beside others.
This webinar will provide insights into all smart services of the Lynx Services Platform (LySP) including demos of these LySP services, as for instance: Named Entity Extraction (NER) by DFKI, Relation Extraction and Question-Answering by SWC, Machine Translation by Tilde or the Lexicala cross-lingual lexical data service by KDictionaries.
The document summarizes TAUS Executive Forum 2015 which will take place on April 9-10 in Tokyo. It discusses TM-Town's mission to create a better translation world through technology and specialization. Key topics covered include job matching algorithms based on natural language processing of prior work, storing linguistic assets in a "bank", and TM-Town's translation enablement platform including unlimited private storage, term extraction, and job matching based on prior work. Areas for improving existing segmentation and alignment libraries are also mentioned.
This document discusses indexing and searching noisy data.
Part I provides examples of noisy data from spelling corrections, speech recognition, machine translation and optical character recognition where information is lost. Part II discusses emerging scenarios for scholarly use of noisy metadata including discovery and exploration of large digital collections. Part III argues that metadata mining can enhance accessibility and understanding by integrating different annotation types from manual, automatic and community sources.
This document discusses standardization in accessibility and the role of standards bodies and standards. It notes that standards can promote interoperability, sharing of information, and accessibility. Open standards that anyone can read and implement are important for ongoing accessibility improvements. Developing accessibility standards requires understanding needs, experience, and active implementations. Some key ICT accessibility standards discussed include WCAG, UAAG, ATAG, and ISO 13066. The document outlines how ÆGIS will define, implement, and identify gaps in accessibility standards.
The document outlines a three-tiered model for providing assistive technology tools to support writing and organization skills. Tier 1 involves tools and strategies readily available to teachers, such as graphic organizers, dictionaries, and planners. Tier 2 involves trials of specialized equipment guided by an assistive technology specialist. The document provides examples of tools, roles and responsibilities, and data collection methods for each tier.
This document provides an overview of a course on digital speech processing. The course will cover fundamentals of speech production and perception, as well as techniques for digital speech processing including short-time Fourier analysis and linear predictive coding methods. Applications that will be discussed include speech coding, synthesis, recognition, and other speech applications involving pattern matching problems. Students will learn about representations and algorithms for processing speech signals.
This document summarizes an online information conference held by the Food and Agriculture Organization of the United Nations (FAO) on using open source systems. The FAO is working to implement an open archive system using Fedora Commons to provide access to its publications and documents. While open source systems provide benefits like low costs, standards compliance and community support, the FAO's experience showed that such projects require adequate funding, staff capacity, cross-departmental collaboration and clear goals to be successful. Both advantages and disadvantages must be considered for open source versus commercial systems.
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionZachary S. Brown
This document provides an overview of automatic speech recognition systems and their components. It discusses:
- The main components of ASR systems including preprocessing/feature extraction, acoustic models, and language models.
- Unique challenges of working with speech data like data volume, quality, and annotation.
- Common modeling approaches used in ASR systems have historically included hidden Markov models and n-gram language models, but more recent approaches use end-to-end neural networks like deep speech, wav2vec, conformer and whisper models.
Building NLP solutions for Davidson ML Groupbotsplash.com
This document provides an overview of natural language processing (NLP) and discusses various NLP applications and techniques. It covers the scope of NLP including natural language understanding, generation, and speech recognition/synthesis. Example applications mentioned include chatbots, sentiment analysis, text classification, summarization, and more. Popular Python packages for NLP like NLTK, SpaCy, and Gensim are also highlighted. Techniques like word embeddings, neural networks, and deep learning approaches to NLP are briefly outlined.
- Monitors the students and helps the teacher in handling questions and feedback.
- Can connect students to speak and manage the queue.
- Can also take over the lesson and present content.
- Enables the teacher to focus on teaching without distractions.
- Supports up to 5 Teacher Assistants per lesson.
- Teacher Assistants can be assigned to monitor specific groups of students.
- Teacher Assistants have full access to student lists, attendance and reports.
- Teacher Assistants can be promoted to become teachers.
- Teacher Assistants can also record lessons.
•Features of an Open Standard LMS - MOODLE
MOODLE (
The document summarizes a presentation about ArchivesSpace, an open source archives management tool. It discusses ArchivesSpace's development process, technical framework, community involvement opportunities, and plans for ongoing support through the organizational home of LYRASIS. ArchivesSpace aims to provide a flexible, scalable, and standards-compliant tool for archival description and management.
The document provides an overview of using Python for bioinformatics, discussing what Python is, why it is useful for bioinformatics, how to set up Python in integrated development environments like Eclipse with PyDev, how to share code using Git and GitHub, and includes examples of Hello World and bioinformatics programs in Python. It introduces Python and argues it is well-suited for bioinformatics due to its extensive standard libraries, ease of use, and wide adoption in science. The document demonstrates how to install Python, set up an IDE, create and run simple Python programs, and use version control with Git and GitHub to collaborate on projects.
This document presents a Request for Proposal (RFP) for ESSENSE, which seeks submissions for a domain-specific language and a "Kernel of Essentials" to support flexible software engineering methods. The goal is to allow practitioners to define, refine, and customize their processes over time. The RFP outlines problems with traditional rigid processes, desires a framework like those used in agile development, and provides requirements for the Kernel (to define fundamental concepts) and the language (to compose practices into methods and represent work progress).
Similar to An HLT profile of the official South African languages (20)
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
This project aims to build and maintain a lexicographical resource for French-based Creole languages through three main steps:
1) Compiling existing lexicographical resources like dictionaries into an electronic format
2) Creating corpora of Creole language texts from literary, educational and journalistic sources online
3) Maintaining the dictionary by analyzing the corpora to identify unknown words and improve the database through an iterative process.
The results will be a lexicographical database detailing variations in French-based Creoles and an annotated corpora for linguistic research.
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
This document summarizes research that semi-automatically extracted a morphological grammar for Southern Ndebele, an under-resourced language, from a general Nguni morphological analyzer bootstrapped from a Zulu analyzer. The Southern Ndebele analyzer produced surprisingly good results, showing significant similarities across Nguni languages that can accelerate documentation and resource development for these languages. The project followed best practices for encoding resources to ensure sustainability, access, and adaptability to future formats and platforms.
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
This document proposes a bag-of-substrings approach to part-of-speech tagging for under-resourced Bantu languages using available digital dictionaries and word lists instead of large annotated corpora. Experimental results showed the technique established a low-resource, high accuracy method for bootstrapping POS tagging that compares favorably to state-of-the-art data-driven approaches. The method extracts substring features from words to train a maximum entropy classifier and bootstrap POS tagging for Bantu languages that lack extensive annotated resources.
Natural Language Processing for Amazigh LanguageGuy De Pauw
The document discusses natural language processing challenges for the Amazigh (Berber) language. It outlines Amazigh language characteristics like its writing system and complex phonology/morphology. It then describes the current state of Amazigh NLP technology, including Tifinaghe encoding, optical character recognition tools, basic processing tools like transliterators and stemmers, and language resources like corpora and dictionaries. Finally, it proposes future directions such as developing larger corpora, machine translation systems, and growing human resources for Amazigh language technology.
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
This document presents a new 50+ million word corpus of the Tajik language, the largest available. It was created by crawling over a dozen Tajik news websites and other sources. The texts were joined and cleaned to remove duplicates. The corpus was then annotated with morphological analysis of Tajik using a new analyzer created by modifying an existing one to be faster and allow lemmatization. The analyzer recognizes over 87% of words and tags them with part of speech. This annotated corpus containing lemmas, tags and frequencies is available online through the Sketch Engine for researchers.
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
This document describes using a metagrammar called XMG to formally capture morphological generalizations of verbs in the Ikota language. It provides an XMG specification for Ikota verbal morphology that describes subject, tense, verb root, aspect, active voice, and proximity. This specification can automatically derive a lexicon of fully inflected verb forms. The methodology allows for quickly testing ideas and validating results against language data.
Tagging and Verifying an Amharic News CorpusGuy De Pauw
This document summarizes an Amharic news corpus tagging and verification project. It discusses the Amharic language background, the corpus creation from Ethiopian news sources, the manual tagging process, previous tagging experiments, and the current efforts to clean and re-tag the corpus which involves removing errors and inconsistencies from the original tagging. Baseline tagging performance on the corpus using different part-of-speech tagsets ranges from 58.3% to 90.8% correct depending on the tagset and machine learning approach used.
This document describes the process of constructing a corpus of spoken and written Santome, a Portuguese-related creole language spoken in Sao Tome and Principe. The corpus contains over 184,000 words from written sources like newspapers and books, as well as transcribed spoken recordings. Efforts were made to standardize the orthography and develop part-of-speech tags for annotation. Metadata is encoded for each text, and the corpus will be made available through a concordancing tool to allow searches while copyright permissions are obtained. The goal is for this and related Gulf of Guinea creole corpora to enable comparative linguistic research.
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
The document describes a system for automatically structuring and correcting Hungarian clinical records. The system separates clinical records into structured XML elements, tags metadata, and separates text from tables. It also corrects spelling errors using language models and weighted edit distances to generate and score candidate corrections. Evaluation showed the system could provide the right correction in the top 5 suggestions for 99% of errors. Areas for improvement include handling insertion/deletion errors and using larger language resources to better handle non-standard usage.
Compiling Apertium Dictionaries with HFSTGuy De Pauw
This document discusses compiling Apertium dictionaries with HFST to leverage generalised compilation formulas and get more applications from fewer language descriptions. Compiling Apertium dictionaries natively in HFST provides benefits like uniform compilation across tools, improved resulting automata using HFST algorithms, and integrated complex finite-state morphology features. Additional applications like spellcheckers can also be automatically generated from the dictionaries.
The Database of Modern Icelandic InflectionGuy De Pauw
The Database of Modern Icelandic Inflection (DMII) is a database that stores the full inflectional forms of Icelandic words. It contains over 5.8 million inflected forms. The DMII aims to represent Icelandic inflection accurately without overgeneration by including all inflected forms and variants. A rule-based system was not feasible due to insufficient data and the tendency for rules to overgenerate. The DMII supports language technology projects and is accessible online for the general public.
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
This document discusses learning Amharic verb morphology using inductive logic programming (ILP). Amharic verbs are complex, conveying information about subject, object, tense, aspect, mood and more through affixation, reduplication and compounding. The authors apply ILP to learn morphological rules from a training set of 216 Amharic verbs. They achieve 86.9% accuracy on a test set of 1,784 verb forms. Key challenges include a lack of similar examples in the training data and learning inappropriate alternation rules. This work contributes to advancing the automatic learning of morphology for under-resourced languages like Amharic.
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
The document discusses the design of a corpus of spoken Irish. It outlines the linguistic background of Irish and issues with existing spoken language resources. It then describes the pilot corpus, including data collection from podcasts and conversations. Transcription guidelines were adapted from CHAT and LDC conventions to balance accuracy with transcription speed. The goal is to create a large, balanced corpus to support research and language preservation.
How to build language technology resources for the next 100 yearsGuy De Pauw
The document discusses how to build sustainable language technology resources for lesser-resourced languages over the next 100 years. It outlines an vision of linguistic diversity and language survival. Key challenges include limited resources, small language communities, and technological limitations. Approaches proposed to work around these include minimizing redundant work, maximizing reuse of resources, building user and developer communities, and preparing resources to work with future technologies. Specific topics covered are types of language technology resources, issues around character encoding, text input methods, and future-proofing keyboard layouts and recognition technologies for many languages.
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
The document proposes standardizing test sets for evaluating compound analyzers by establishing parameters for a standard test set. It discusses evaluating compound analyzers on different sized test sets containing compound words, non-compound words, and error words. Experiments compare analyzer performance on test sets of varying sizes, finding sizes below 250 words are too small and sizes above 1250 words show no significant differences in results. The proposed standard test set consists of 500 examples each of compounds, non-compounds, and errors for a total of 1500 words.
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
The document discusses new paradigms for developing African language resources through the Pan African Living Dictionary Online (PALDO) project. The paradigms include open community participation under scholarly supervision, paying for data development and making the data freely available, and linking monolingual dictionaries for multiple languages by concept to create rich resources for each language.
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
1. The document presents a system for recognizing handwritten Yoruba characters using a Bayesian classifier and decision tree approach.
2. Key stages of the system include preprocessing, segmentation, feature extraction, Bayesian classification, decision tree processing, and result fusion.
3. The system was tested on independent and non-independent character samples, achieving recognition rates of 91.18% and 94.44% respectively.
A Number to Yorùbá Text Transcription SystemGuy De Pauw
The document describes a system for converting Arabic numbers to their Yoruba lexical equivalents. It discusses Yoruba numerals and their derivation using addition, subtraction and multiplication. A computational model is presented using pushdown automata to capture the number conversion. The system was implemented in Python and evaluated using Mean Opinion Score testing. Examples of number conversions like 19679 are provided to demonstrate the system.
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Guy De Pauw
This document discusses experiments conducted on developing an English to Amharic statistical machine translation system. It summarizes the collection and processing of two parallel corpora: (1) An ENA news corpus containing over 35,000 documents and 500,000 words aligned at the sentence level. (2) A parliamentary corpus containing over 1,200 documents and 500,000 words aligned at the sentence level. The document outlines challenges in aligning the corpora and reports on automatic alignment results. It concludes that further increasing corpus size and integrating linguistic knowledge can help improve translation quality.
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
Discover top-tier mobile app development services, offering innovative solutions for iOS and Android. Enhance your business with custom, user-friendly mobile applications.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
"NATO Hackathon Winner: AI-Powered Drug Search", Taras KlobaFwdays
This is a session that details how PostgreSQL's features and Azure AI Services can be effectively used to significantly enhance the search functionality in any application.
In this session, we'll share insights on how we used PostgreSQL to facilitate precise searches across multiple fields in our mobile application. The techniques include using LIKE and ILIKE operators and integrating a trigram-based search to handle potential misspellings, thereby increasing the search accuracy.
We'll also discuss how the azure_ai extension on PostgreSQL databases in Azure and Azure AI Services were utilized to create vectors from user input, a feature beneficial when users wish to find specific items based on text prompts. While our application's case study involves a drug search, the techniques and principles shared in this session can be adapted to improve search functionality in a wide range of applications. Join us to learn how PostgreSQL and Azure AI can be harnessed to enhance your application's search capability.
From Natural Language to Structured Solr Queries using LLMsSease
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
QA or the Highway - Component Testing: Bridging the gap between frontend appl...zjhamm304
These are the slides for the presentation, "Component Testing: Bridging the gap between frontend applications" that was presented at QA or the Highway 2024 in Columbus, OH by Zachary Hamm.
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Introduction of Cybersecurity with OSS at Code Europe 2024
An HLT profile of the official South African languages
1. An HLT profile of the official
South African languages
Aditi Sharma Grover1,2, Gerhard B van Huyssteen1,3 & Marthinus W. Pretorius2
1HLT Research Group, CSIR, South Africa
2Graduate School of Technology Management, University of Pretoria, South Africa
3Centre for Text Technology (CTexT), North-West University, South Africa
2. Overview
• Background
• Process
• Results
• Conclusion
3. Background
South African HLT landscape
• 11 official languages
• HLT community
– R&D community (universities
& science councils)
– Very few private sector companies
• Various government initiatives
– DST: HLT road-mapping process, NHN
– DAC: HLT strategy, National Centre for HLT
– NRF: research funding
4. Background
Challenge
• SA has not yet capitalised on opportunities to
create a thriving HLT industry
– Lack of awareness within the local HLT community
• Perpetuated by perceived fragmentation of
South African R&D activities
– Lack of a unified technological profile of HLT
activities across the 11 languages
• 2009: a technology audit for the South African
HLT landscape (SAHLTA)
– Align R&D activities and stimulate cooperation
– Similar to Dutch (BLaRK), EuroMap
5. Process
SAHLTA Process
Inventory Cursory Audit
Terminology Questionnaire
criteria inventory workshop
6. Process
SAHLTA Process
Phase 1
Inventory Cursory Audit
Terminology Questionnaire
criteria inventory workshop
Establish lingua franca
Consolidate prior knowledge regarding data,
modules, applications, and platforms/tools
7. Process
SAHLTA Process
Phase 2
Inventory Cursory Audit
Terminology Questionnaire
criteria inventory workshop
Priorities
8. Results
Prioritisation Priorities
Preliminary HLT
• Based on international trends, local needs, and
feasibility
• Priority 1: Basic & robust core HLT technology
applications, modules and data
• Priority 2, 3: LRs that further enhance and
complement core LRs (priority 1), and base their
development on a strong foundation of core HLT
LRs
– Many advanced HLT applications are priority 2, 3
• Verification by larger SA HLT community
– Need to be updated regularly
9. Results
Preliminary HLT Priorities
Priority 1: Applications
Speech
Text
• Proofing tools • Accessibility
• Information • Telephony
Extraction applications
• Information Retrieval • Computer-assisted
• Human-aided language learning
machine translation • Voice search
• Machine-aided • Audio management
human translation
16. Results
Maturity Index
• Maturity stages:
– Under development (UD), Alpha version (AV),
Beta version (BV) , Released (RV)
• Maturity Index
– Measure of the maturity of HLT components in a
language.
– Considers the maturity stage of item against the
relative importance of each maturity stage
– MaturityInd = Σ (1.UD+2.AV+4.BV+8.RV)/
Σ Weights of maturity stages
18. Results
Accessibility Index
• Accessibility stages:
– Unspecified (UN), Not available (NA) (proprietary or
contract R&D), Research and education (RE), Available
for commercial purposes (CO), Available for
commercial purposes and R&E (CRE)
• Accessibility Index
– Measure of the accessibility of HLT components in a
language
– Considers the accessibility stage of an item against the
relative importance of each accessibility stage
– AccessInd = Σ (1.UN+2.NA+4.RE+8.CO+12.CRE)/
Σ Weights of accessibility stages
20. Results
HLT Language Index
• Impressionistic index that relatively ranks
languages based on the total quantity of HLT
activity per language
• Considers the stage of maturity and
accessibility of all the HLT components
• HLT Language Index = Maturity Index
(per language, all components)
+
Accessibility Index
22. Results
HLT Component Indexes
• Alternative perspective:
• Quantity of activity taking place within
each of the data, modules, and
applications on a HLT component
grouping level (e.g. pronunciation
resources)
24. Results
HLT Detailed
Inventory
: Item exists, is accessible,
released & of fairly
adequate quality
: Item may exist but
available for restricted use
not released/limited quality
: Items do not exist
‘–’: Category is not
applicable to the language
25. Results
Gap Analysis
(speech)
: Item exists, is accessible,
released & of fairly
adequate quality
: Item may exist but
available for restricted
use or not released/
limited quality
: Items do not exist
‘–’: Category not
applicable to
the language
26. Results
SAHLTA Outcomes
• A SAHLTA online database of LRs and
applications (alpha)
www.meraka.org.za/nhnaudit
28. Conclusion
Summary
• Few resources available, of basic nature
• Several factors influence this:
– HLT expert knowledge and interests
– Availability of data resources
– Market needs of a language
– Relatedness to other world languages
29. Conclusion
Recommendations
• Further resource development based on gap analysis
– Also of more advanced LRs
• Availability and distribution of existing LRs
– To enable usage, licensing agreements need to be in place
• Funding: support by government in formative years
– Also industry stimulation programmes (e.g. support for
R&D consortia)
• Collaborations: across SA and internationally, also
based on gap analysis
• Human capital development (HCD): scientific &
technical, cross silos of academic disciplines,
especially for lesser-resourced languages
30. Conclusion
Acknowledgments
• DST – project sponsorship
• Prof Sonja Bosch & Prof Laurette Pretorius – results
of the 2008 BLaRK survey
• Audit mini-workshop contributors
– Prof. Danie Prinsloo (UP), Prof. Sonja Bosch (UNISA), Mr. Martin Puttkammer
(NWU), Prof. Gerhard van Huyssteen (CSIR), Prof. Etienne Barnard (CSIR), Dr.
Febe de Wet (US), Dr. Marelie Davel (CSIR)
• Numerous audit participants
• Various HLT RG members – guidance and support
www.meraka.org.za/nhnaudit