This project aims to build and maintain a lexicographical resource for French-based Creole languages through three main steps:
1) Compiling existing lexicographical resources like dictionaries into an electronic format
2) Creating corpora of Creole language texts from literary, educational and journalistic sources online
3) Maintaining the dictionary by analyzing the corpora to identify unknown words and improve the database through an iterative process.
The results will be a lexicographical database detailing variations in French-based Creoles and an annotated corpora for linguistic research.
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
This document presents a new 50+ million word corpus of the Tajik language, the largest available. It was created by crawling over a dozen Tajik news websites and other sources. The texts were joined and cleaned to remove duplicates. The corpus was then annotated with morphological analysis of Tajik using a new analyzer created by modifying an existing one to be faster and allow lemmatization. The analyzer recognizes over 87% of words and tags them with part of speech. This annotated corpus containing lemmas, tags and frequencies is available online through the Sketch Engine for researchers.
CLIN 2012: DutchSemCor Building a semantically annotated corpus for DutchRubén Izquierdo Beviá
The document summarizes the DutchSemCor project, which aims to build a large semantically annotated corpus for Dutch. It discusses the goals of annotating senses from the Cornetto database and WordNet Domains, as well as named entities. It describes the multi-phase process including an initial phase of annotating examples for 3,000 frequent words, followed by training word sense disambiguation systems using this data and an active learning phase. Initial results are presented for the UKB and Tilburg WSD systems, showing improvements from adding additional features and context. Future work includes optimizing these systems and applying the best system to fully annotate the 500,000+ token corpora.
This document summarizes research on part-of-speech tagging for Arabic language text. It describes constructing an automatic lexicon for Arabic and an HMM-based POS tagger called APT. APT assigns part-of-speech tags to words by searching roots in a lexicon and applying tags based on affixes through stemming. The statistical tagger achieved 90% accuracy and the HMM tagger achieved 97% F-measure for POS tagging Arabic text. Part-of-speech tagging is an important application in natural language processing.
The document discusses part-of-speech tagging of Arabic text. It applies the Brill tagger, a machine learning tagger originally designed for English, to tag Modern Standard Arabic text. The author modifies the Brill tagger and a version of the Khoja tagset to tag an Arabic corpus obtained from Jordanian magazines. Rules are produced to tag words based on their lexical structure and context. The tagging accuracy is approximately 84% for known and unknown words, which is considered promising given the complexity of Arabic.
1) The use of corpus linguistics in lexicography began in the 1950s as a response to the need for empirical language data to analyze words, meanings, and usage. Early corpora like the Brown Corpus and LOB Corpus compiled millions of words but were limited.
2) John Sinclair pioneered the use of larger electronic corpora and the analysis of collocations. Major corpora like the British National Corpus provided balanced samples of written and spoken English totaling over 100 million words.
3) Corpora revolutionized lexicography by providing large samples of authentic language use. Lexicographers can now obtain frequency and context data to define words accurately and identify collocations rather than relying on manual methods. This made
Introductory lecture on Corpus Linguistics. Contents: Corpus linguistics: past and present, What is a corpus?, Why use computers to study language? Corpus-based vs. Intuition-based approach, Theory vs. Methodology.
This lecture was based on McEnery et al. 2006. Corpus-based Language Studies. An Advanced resource book. Routlege.
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
This document presents a new 50+ million word corpus of the Tajik language, the largest available. It was created by crawling over a dozen Tajik news websites and other sources. The texts were joined and cleaned to remove duplicates. The corpus was then annotated with morphological analysis of Tajik using a new analyzer created by modifying an existing one to be faster and allow lemmatization. The analyzer recognizes over 87% of words and tags them with part of speech. This annotated corpus containing lemmas, tags and frequencies is available online through the Sketch Engine for researchers.
CLIN 2012: DutchSemCor Building a semantically annotated corpus for DutchRubén Izquierdo Beviá
The document summarizes the DutchSemCor project, which aims to build a large semantically annotated corpus for Dutch. It discusses the goals of annotating senses from the Cornetto database and WordNet Domains, as well as named entities. It describes the multi-phase process including an initial phase of annotating examples for 3,000 frequent words, followed by training word sense disambiguation systems using this data and an active learning phase. Initial results are presented for the UKB and Tilburg WSD systems, showing improvements from adding additional features and context. Future work includes optimizing these systems and applying the best system to fully annotate the 500,000+ token corpora.
This document summarizes research on part-of-speech tagging for Arabic language text. It describes constructing an automatic lexicon for Arabic and an HMM-based POS tagger called APT. APT assigns part-of-speech tags to words by searching roots in a lexicon and applying tags based on affixes through stemming. The statistical tagger achieved 90% accuracy and the HMM tagger achieved 97% F-measure for POS tagging Arabic text. Part-of-speech tagging is an important application in natural language processing.
The document discusses part-of-speech tagging of Arabic text. It applies the Brill tagger, a machine learning tagger originally designed for English, to tag Modern Standard Arabic text. The author modifies the Brill tagger and a version of the Khoja tagset to tag an Arabic corpus obtained from Jordanian magazines. Rules are produced to tag words based on their lexical structure and context. The tagging accuracy is approximately 84% for known and unknown words, which is considered promising given the complexity of Arabic.
1) The use of corpus linguistics in lexicography began in the 1950s as a response to the need for empirical language data to analyze words, meanings, and usage. Early corpora like the Brown Corpus and LOB Corpus compiled millions of words but were limited.
2) John Sinclair pioneered the use of larger electronic corpora and the analysis of collocations. Major corpora like the British National Corpus provided balanced samples of written and spoken English totaling over 100 million words.
3) Corpora revolutionized lexicography by providing large samples of authentic language use. Lexicographers can now obtain frequency and context data to define words accurately and identify collocations rather than relying on manual methods. This made
Introductory lecture on Corpus Linguistics. Contents: Corpus linguistics: past and present, What is a corpus?, Why use computers to study language? Corpus-based vs. Intuition-based approach, Theory vs. Methodology.
This lecture was based on McEnery et al. 2006. Corpus-based Language Studies. An Advanced resource book. Routlege.
This document provides an introduction to linguistics and summarizes several key concepts and theories. It covers the main branches of linguistics like phonetics, phonology, morphology, syntax, semantics, and pragmatics. Some key topics discussed include morphemes, language acquisition theories, Chomsky's work on universal grammar, and the history and development of the English language. The document also defines several important linguistic terms and outlines different approaches to teaching English as a second language.
Historical linguistics studies language change over time by examining how languages have evolved from earlier stages to their modern forms. Languages are constantly changing in their phonology, morphology, syntax, and lexicon. Historical linguists use the comparative method and reconstruct ancestral proto-languages to show genetic relationships between languages and trace their development from a common ancestor. Sound changes from one language stage to another usually follow regular phonetic processes.
Laura Welcher - The Rosetta Project and The Language Commonslongnow
This document provides information about the Rosetta Project and its goal of creating a 10,000 year library of all human languages. It discusses the motivation to preserve languages and cultural knowledge for future generations. Specific initiatives described include creating Rosetta Disks with parallel text in multiple languages, building an open digital collection of language resources, and developing the proposed Language Commons Encyclopedia of Human Language to aggregate information on all 6,900 human languages. The role of the Long Now Foundation in supporting these initiatives is also outlined.
The document discusses how languages change over time through natural processes. It notes that after 1,000 years, languages diverge to the point of no longer being mutually intelligible, and after 10,000 years the relationship becomes indistinguishable from unrelated languages. The rate of change varies, but systematic sound changes and borrowing are the main drivers of divergence. The comparative method is used to reconstruct ancestral languages and classify languages into families based on regular sound correspondences.
This document provides an introduction to phonetics and spectrogram reading. It discusses why spectrogram reading is useful for understanding speech despite not being commonly used in jobs. It then defines key terms in phonetics like phonemes, allophones, distinctive features, and the speech production apparatus. Examples are given to illustrate terms like coarticulation and how features describe similarities between sounds.
This document discusses the story of Holopedia, an attempt to create a Wikipedia in Taiwanese (Min Nan). [1] It describes the major failures experienced in getting a Taiwanese language Wikipedia approved and operational. [2] Initiators purchased a domain and installed Mediawiki software but faced challenges with updates, connectivity to other Wikipedias, and digitizing before standardizing the language. [3] The experience led to the emergence of other Chinese dialect Wikipedias and accumulating new knowledge in Holopedia about Hakka culture and language.
INTERDISCIPLINARY CHALLENGE USING TEI FOR BUILDING MULTILINGUAL DIGITAL CORPORA: A MAGHREB DH CASE STUDY. Digital Humanities Institute – Beirut, American University of Beirut. 2-6 March 2015
Exploring rhetoric in the Electronic EnlightenmentMartin Wynne
An exploration of the steps necessary to prepare a corpus of (mainly) eighteenth century correspondence and make it available for interactive exploration of linguistic and stylistic features.
- Languages change over time through natural processes like sound changes in casual speech that become conventionalized, as well as language acquisition by children and language contact between groups.
- The rate of change varies, but after 1,000 years languages become mutually unintelligible, and after 10,000 years the relationship is indistinguishable from unrelated languages.
- Comparative methods are used to reconstruct proto-languages and classify language families by examining regular sound correspondences between related languages. For example, Grimm's Law describes consonant changes between Proto-Indo-European and Proto-Germanic languages.
07 - Sociolinguistics.pdf based on sidesJoseCotes7
Sociolinguistics is the study of how language interacts with social factors like ethnicity, class, region, age and gender. It examines language use in real-world contexts as opposed to the theoretical aspects of a language. Sociolinguistics analyzes linguistic variation between dialects, idiolects and sociolects and how social meanings become attached to certain linguistic forms. Regional dialects vary according to geography and develop when groups are separated socially or geographically over time. A standard dialect becomes associated with power and prestige in a society but what is considered standard is determined by historical and social factors, not the linguistic system itself.
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
This document summarizes research that semi-automatically extracted a morphological grammar for Southern Ndebele, an under-resourced language, from a general Nguni morphological analyzer bootstrapped from a Zulu analyzer. The Southern Ndebele analyzer produced surprisingly good results, showing significant similarities across Nguni languages that can accelerate documentation and resource development for these languages. The project followed best practices for encoding resources to ensure sustainability, access, and adaptability to future formats and platforms.
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
This document proposes a bag-of-substrings approach to part-of-speech tagging for under-resourced Bantu languages using available digital dictionaries and word lists instead of large annotated corpora. Experimental results showed the technique established a low-resource, high accuracy method for bootstrapping POS tagging that compares favorably to state-of-the-art data-driven approaches. The method extracts substring features from words to train a maximum entropy classifier and bootstrap POS tagging for Bantu languages that lack extensive annotated resources.
Natural Language Processing for Amazigh LanguageGuy De Pauw
The document discusses natural language processing challenges for the Amazigh (Berber) language. It outlines Amazigh language characteristics like its writing system and complex phonology/morphology. It then describes the current state of Amazigh NLP technology, including Tifinaghe encoding, optical character recognition tools, basic processing tools like transliterators and stemmers, and language resources like corpora and dictionaries. Finally, it proposes future directions such as developing larger corpora, machine translation systems, and growing human resources for Amazigh language technology.
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
This document describes using a metagrammar called XMG to formally capture morphological generalizations of verbs in the Ikota language. It provides an XMG specification for Ikota verbal morphology that describes subject, tense, verb root, aspect, active voice, and proximity. This specification can automatically derive a lexicon of fully inflected verb forms. The methodology allows for quickly testing ideas and validating results against language data.
Tagging and Verifying an Amharic News CorpusGuy De Pauw
This document summarizes an Amharic news corpus tagging and verification project. It discusses the Amharic language background, the corpus creation from Ethiopian news sources, the manual tagging process, previous tagging experiments, and the current efforts to clean and re-tag the corpus which involves removing errors and inconsistencies from the original tagging. Baseline tagging performance on the corpus using different part-of-speech tagsets ranges from 58.3% to 90.8% correct depending on the tagset and machine learning approach used.
This document describes the process of constructing a corpus of spoken and written Santome, a Portuguese-related creole language spoken in Sao Tome and Principe. The corpus contains over 184,000 words from written sources like newspapers and books, as well as transcribed spoken recordings. Efforts were made to standardize the orthography and develop part-of-speech tags for annotation. Metadata is encoded for each text, and the corpus will be made available through a concordancing tool to allow searches while copyright permissions are obtained. The goal is for this and related Gulf of Guinea creole corpora to enable comparative linguistic research.
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
The document describes a system for automatically structuring and correcting Hungarian clinical records. The system separates clinical records into structured XML elements, tags metadata, and separates text from tables. It also corrects spelling errors using language models and weighted edit distances to generate and score candidate corrections. Evaluation showed the system could provide the right correction in the top 5 suggestions for 99% of errors. Areas for improvement include handling insertion/deletion errors and using larger language resources to better handle non-standard usage.
Compiling Apertium Dictionaries with HFSTGuy De Pauw
This document discusses compiling Apertium dictionaries with HFST to leverage generalised compilation formulas and get more applications from fewer language descriptions. Compiling Apertium dictionaries natively in HFST provides benefits like uniform compilation across tools, improved resulting automata using HFST algorithms, and integrated complex finite-state morphology features. Additional applications like spellcheckers can also be automatically generated from the dictionaries.
The Database of Modern Icelandic InflectionGuy De Pauw
The Database of Modern Icelandic Inflection (DMII) is a database that stores the full inflectional forms of Icelandic words. It contains over 5.8 million inflected forms. The DMII aims to represent Icelandic inflection accurately without overgeneration by including all inflected forms and variants. A rule-based system was not feasible due to insufficient data and the tendency for rules to overgenerate. The DMII supports language technology projects and is accessible online for the general public.
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
This document discusses learning Amharic verb morphology using inductive logic programming (ILP). Amharic verbs are complex, conveying information about subject, object, tense, aspect, mood and more through affixation, reduplication and compounding. The authors apply ILP to learn morphological rules from a training set of 216 Amharic verbs. They achieve 86.9% accuracy on a test set of 1,784 verb forms. Key challenges include a lack of similar examples in the training data and learning inappropriate alternation rules. This work contributes to advancing the automatic learning of morphology for under-resourced languages like Amharic.
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
The document discusses the design of a corpus of spoken Irish. It outlines the linguistic background of Irish and issues with existing spoken language resources. It then describes the pilot corpus, including data collection from podcasts and conversations. Transcription guidelines were adapted from CHAT and LDC conventions to balance accuracy with transcription speed. The goal is to create a large, balanced corpus to support research and language preservation.
More Related Content
Similar to Technological Tools for Dictionary and Corpora Building for Minority Languages: Example of the French-based Creoles
This document provides an introduction to linguistics and summarizes several key concepts and theories. It covers the main branches of linguistics like phonetics, phonology, morphology, syntax, semantics, and pragmatics. Some key topics discussed include morphemes, language acquisition theories, Chomsky's work on universal grammar, and the history and development of the English language. The document also defines several important linguistic terms and outlines different approaches to teaching English as a second language.
Historical linguistics studies language change over time by examining how languages have evolved from earlier stages to their modern forms. Languages are constantly changing in their phonology, morphology, syntax, and lexicon. Historical linguists use the comparative method and reconstruct ancestral proto-languages to show genetic relationships between languages and trace their development from a common ancestor. Sound changes from one language stage to another usually follow regular phonetic processes.
Laura Welcher - The Rosetta Project and The Language Commonslongnow
This document provides information about the Rosetta Project and its goal of creating a 10,000 year library of all human languages. It discusses the motivation to preserve languages and cultural knowledge for future generations. Specific initiatives described include creating Rosetta Disks with parallel text in multiple languages, building an open digital collection of language resources, and developing the proposed Language Commons Encyclopedia of Human Language to aggregate information on all 6,900 human languages. The role of the Long Now Foundation in supporting these initiatives is also outlined.
The document discusses how languages change over time through natural processes. It notes that after 1,000 years, languages diverge to the point of no longer being mutually intelligible, and after 10,000 years the relationship becomes indistinguishable from unrelated languages. The rate of change varies, but systematic sound changes and borrowing are the main drivers of divergence. The comparative method is used to reconstruct ancestral languages and classify languages into families based on regular sound correspondences.
This document provides an introduction to phonetics and spectrogram reading. It discusses why spectrogram reading is useful for understanding speech despite not being commonly used in jobs. It then defines key terms in phonetics like phonemes, allophones, distinctive features, and the speech production apparatus. Examples are given to illustrate terms like coarticulation and how features describe similarities between sounds.
This document discusses the story of Holopedia, an attempt to create a Wikipedia in Taiwanese (Min Nan). [1] It describes the major failures experienced in getting a Taiwanese language Wikipedia approved and operational. [2] Initiators purchased a domain and installed Mediawiki software but faced challenges with updates, connectivity to other Wikipedias, and digitizing before standardizing the language. [3] The experience led to the emergence of other Chinese dialect Wikipedias and accumulating new knowledge in Holopedia about Hakka culture and language.
INTERDISCIPLINARY CHALLENGE USING TEI FOR BUILDING MULTILINGUAL DIGITAL CORPORA: A MAGHREB DH CASE STUDY. Digital Humanities Institute – Beirut, American University of Beirut. 2-6 March 2015
Exploring rhetoric in the Electronic EnlightenmentMartin Wynne
An exploration of the steps necessary to prepare a corpus of (mainly) eighteenth century correspondence and make it available for interactive exploration of linguistic and stylistic features.
- Languages change over time through natural processes like sound changes in casual speech that become conventionalized, as well as language acquisition by children and language contact between groups.
- The rate of change varies, but after 1,000 years languages become mutually unintelligible, and after 10,000 years the relationship is indistinguishable from unrelated languages.
- Comparative methods are used to reconstruct proto-languages and classify language families by examining regular sound correspondences between related languages. For example, Grimm's Law describes consonant changes between Proto-Indo-European and Proto-Germanic languages.
07 - Sociolinguistics.pdf based on sidesJoseCotes7
Sociolinguistics is the study of how language interacts with social factors like ethnicity, class, region, age and gender. It examines language use in real-world contexts as opposed to the theoretical aspects of a language. Sociolinguistics analyzes linguistic variation between dialects, idiolects and sociolects and how social meanings become attached to certain linguistic forms. Regional dialects vary according to geography and develop when groups are separated socially or geographically over time. A standard dialect becomes associated with power and prestige in a society but what is considered standard is determined by historical and social factors, not the linguistic system itself.
Similar to Technological Tools for Dictionary and Corpora Building for Minority Languages: Example of the French-based Creoles (10)
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
This document summarizes research that semi-automatically extracted a morphological grammar for Southern Ndebele, an under-resourced language, from a general Nguni morphological analyzer bootstrapped from a Zulu analyzer. The Southern Ndebele analyzer produced surprisingly good results, showing significant similarities across Nguni languages that can accelerate documentation and resource development for these languages. The project followed best practices for encoding resources to ensure sustainability, access, and adaptability to future formats and platforms.
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
This document proposes a bag-of-substrings approach to part-of-speech tagging for under-resourced Bantu languages using available digital dictionaries and word lists instead of large annotated corpora. Experimental results showed the technique established a low-resource, high accuracy method for bootstrapping POS tagging that compares favorably to state-of-the-art data-driven approaches. The method extracts substring features from words to train a maximum entropy classifier and bootstrap POS tagging for Bantu languages that lack extensive annotated resources.
Natural Language Processing for Amazigh LanguageGuy De Pauw
The document discusses natural language processing challenges for the Amazigh (Berber) language. It outlines Amazigh language characteristics like its writing system and complex phonology/morphology. It then describes the current state of Amazigh NLP technology, including Tifinaghe encoding, optical character recognition tools, basic processing tools like transliterators and stemmers, and language resources like corpora and dictionaries. Finally, it proposes future directions such as developing larger corpora, machine translation systems, and growing human resources for Amazigh language technology.
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
This document describes using a metagrammar called XMG to formally capture morphological generalizations of verbs in the Ikota language. It provides an XMG specification for Ikota verbal morphology that describes subject, tense, verb root, aspect, active voice, and proximity. This specification can automatically derive a lexicon of fully inflected verb forms. The methodology allows for quickly testing ideas and validating results against language data.
Tagging and Verifying an Amharic News CorpusGuy De Pauw
This document summarizes an Amharic news corpus tagging and verification project. It discusses the Amharic language background, the corpus creation from Ethiopian news sources, the manual tagging process, previous tagging experiments, and the current efforts to clean and re-tag the corpus which involves removing errors and inconsistencies from the original tagging. Baseline tagging performance on the corpus using different part-of-speech tagsets ranges from 58.3% to 90.8% correct depending on the tagset and machine learning approach used.
This document describes the process of constructing a corpus of spoken and written Santome, a Portuguese-related creole language spoken in Sao Tome and Principe. The corpus contains over 184,000 words from written sources like newspapers and books, as well as transcribed spoken recordings. Efforts were made to standardize the orthography and develop part-of-speech tags for annotation. Metadata is encoded for each text, and the corpus will be made available through a concordancing tool to allow searches while copyright permissions are obtained. The goal is for this and related Gulf of Guinea creole corpora to enable comparative linguistic research.
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
The document describes a system for automatically structuring and correcting Hungarian clinical records. The system separates clinical records into structured XML elements, tags metadata, and separates text from tables. It also corrects spelling errors using language models and weighted edit distances to generate and score candidate corrections. Evaluation showed the system could provide the right correction in the top 5 suggestions for 99% of errors. Areas for improvement include handling insertion/deletion errors and using larger language resources to better handle non-standard usage.
Compiling Apertium Dictionaries with HFSTGuy De Pauw
This document discusses compiling Apertium dictionaries with HFST to leverage generalised compilation formulas and get more applications from fewer language descriptions. Compiling Apertium dictionaries natively in HFST provides benefits like uniform compilation across tools, improved resulting automata using HFST algorithms, and integrated complex finite-state morphology features. Additional applications like spellcheckers can also be automatically generated from the dictionaries.
The Database of Modern Icelandic InflectionGuy De Pauw
The Database of Modern Icelandic Inflection (DMII) is a database that stores the full inflectional forms of Icelandic words. It contains over 5.8 million inflected forms. The DMII aims to represent Icelandic inflection accurately without overgeneration by including all inflected forms and variants. A rule-based system was not feasible due to insufficient data and the tendency for rules to overgenerate. The DMII supports language technology projects and is accessible online for the general public.
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
This document discusses learning Amharic verb morphology using inductive logic programming (ILP). Amharic verbs are complex, conveying information about subject, object, tense, aspect, mood and more through affixation, reduplication and compounding. The authors apply ILP to learn morphological rules from a training set of 216 Amharic verbs. They achieve 86.9% accuracy on a test set of 1,784 verb forms. Key challenges include a lack of similar examples in the training data and learning inappropriate alternation rules. This work contributes to advancing the automatic learning of morphology for under-resourced languages like Amharic.
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
The document discusses the design of a corpus of spoken Irish. It outlines the linguistic background of Irish and issues with existing spoken language resources. It then describes the pilot corpus, including data collection from podcasts and conversations. Transcription guidelines were adapted from CHAT and LDC conventions to balance accuracy with transcription speed. The goal is to create a large, balanced corpus to support research and language preservation.
How to build language technology resources for the next 100 yearsGuy De Pauw
The document discusses how to build sustainable language technology resources for lesser-resourced languages over the next 100 years. It outlines an vision of linguistic diversity and language survival. Key challenges include limited resources, small language communities, and technological limitations. Approaches proposed to work around these include minimizing redundant work, maximizing reuse of resources, building user and developer communities, and preparing resources to work with future technologies. Specific topics covered are types of language technology resources, issues around character encoding, text input methods, and future-proofing keyboard layouts and recognition technologies for many languages.
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
The document proposes standardizing test sets for evaluating compound analyzers by establishing parameters for a standard test set. It discusses evaluating compound analyzers on different sized test sets containing compound words, non-compound words, and error words. Experiments compare analyzer performance on test sets of varying sizes, finding sizes below 250 words are too small and sizes above 1250 words show no significant differences in results. The proposed standard test set consists of 500 examples each of compounds, non-compounds, and errors for a total of 1500 words.
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
The document discusses new paradigms for developing African language resources through the Pan African Living Dictionary Online (PALDO) project. The paradigms include open community participation under scholarly supervision, paying for data development and making the data freely available, and linking monolingual dictionaries for multiple languages by concept to create rich resources for each language.
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
1. The document presents a system for recognizing handwritten Yoruba characters using a Bayesian classifier and decision tree approach.
2. Key stages of the system include preprocessing, segmentation, feature extraction, Bayesian classification, decision tree processing, and result fusion.
3. The system was tested on independent and non-independent character samples, achieving recognition rates of 91.18% and 94.44% respectively.
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
The document describes the development of an English to Yoruba machine translation system called IFE-MT. It discusses the theoretical and practical issues in building the system, including the differences between the languages. It outlines the data collection and annotation process. It also describes the software tools and modules used to implement the system and demonstrates its capabilities. The system is being further developed by expanding the database and evaluating the translations.
A Number to Yorùbá Text Transcription SystemGuy De Pauw
The document describes a system for converting Arabic numbers to their Yoruba lexical equivalents. It discusses Yoruba numerals and their derivation using addition, subtraction and multiplication. A computational model is presented using pushdown automata to capture the number conversion. The system was implemented in Python and evaluated using Mean Opinion Score testing. Examples of number conversions like 19679 are provided to demonstrate the system.
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Guy De Pauw
This document discusses experiments conducted on developing an English to Amharic statistical machine translation system. It summarizes the collection and processing of two parallel corpora: (1) An ENA news corpus containing over 35,000 documents and 500,000 words aligned at the sentence level. (2) A parliamentary corpus containing over 1,200 documents and 500,000 words aligned at the sentence level. The document outlines challenges in aligning the corpora and reports on automatic alignment results. It concludes that further increasing corpus size and integrating linguistic knowledge can help improve translation quality.
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
The document proposes standardizing test sets for evaluating compound analyzers by establishing parameters for a standard test set. It discusses evaluating compound analyzers on different sized test sets containing compound words, non-compound words, and error words. Experiments compare analyzer performance on test sets of varying sizes, finding sizes below 250 words are too small and sizes above 1250 words show no significant differences in results. The proposed standard test set consists of 500 examples each of compounds, non-compounds, and errors for a total of 1500 words.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Pushing the limits of ePRTC: 100ns holdover for 100 days
Technological Tools for Dictionary and Corpora Building for Minority Languages: Example of the French-based Creoles
1. Technological tools for dictionary and corpora building for minority languages: example of the
French-based Creoles
Paola Carrión Gonzalez(1,2), Emmanuel Cartier(1)
(1)LDI, CNRS UMR 7187, Université Paris 13 PRES Paris Sorbonne Paris Cité
(2)Departamento de Traducción e Interpretación, Facultad de Filosofía Y Letras, Universidad de Alicante
E-mail : pccg1@alu.ua.es , ecartier@ldi.univ-paris13.fr
We present a project which aims at building and maintaining a lexicographical resource of contemporary French-based creoles, still considered as minority languages,
especially those situated in American-Caribbean zones. These objectives are achieved through three main steps: 1) Compilation of existing lexicographical resources (lexicons
and digitized dictionaries, available on the Internet); 2) Constitution of a corpus in Creole languages with literary, educational and journalistic documents, some of them retrieved
automatically with web spiders; 3) Dictionary maintenance: through automatic morphosyntactic analysis of the corpus and determination of the frequency of unknown words.
Those unknown words will help us to improve the database by searching relevant lexical resources that we had not included before. This final task could be done iteratively in
order to complete the database and show language variations within the same Creole-speaking community. Practical results of this work will consist in 1/ A lexicographical
database, explicitating variations in French-based creoles, as well as helping normalizing the written form of this language; 2/ An annotated corpora that could be used for
further linguistic research and NLP applications.
• A. Bollée (University of Bamberg) [2007] / R. Chaudenson's project [1979]
[19th century -1885] Main problems of Importance of monolingual
Lafcadio Hearn (Creole lexicographical works • 4 Volumes (3 french origin / 1 unknown origin) | French word forms → creole
proverbs) | bilingual development [Hazaël- DECOI variations KRENGLE (18000 entries) Haitian Creole – English dictionary, by Eric Kafe
lexicons Massieux : 2002] (University of Copenhagen). Connected to Wordnet (semantic relations). Partly
• paille,n.f. downloaded for free [http://www.krengle.net]
• 1. Haitian / Indian • Disparity of macro and • Arnaud Carpooran:
Ocean Creole microstructures Diksioner Morisien, • lou.pay,lapay,lepayn.‘paille;cosse,enveloppe,spathe(demaïs,delacanneàsucre,etc.);styl
languages • Orthographic Koleksion Text Kreol, Ile esetstigmatesdumaïs’,kaselapay‘rompreuneamitié’(AVaDLC)haï.payn.‘feuillesèche,spath
• 2. Caribbean Creoles conventions → Maurice, 2009 e,chaume,son’(ALH1557),‘chaff,straw[drygrass],anydrycombustibleplant;hay;husk;bran;
• Precursors: Albert continuum thatch;waste,scraps;misery,suffering,trouble;low,badorlosingcard’,fèpay‘tobefrequent;t
Valdman, Annegret • Content → spelling
Bollée, Robert variants, available
Example obenumerous’,fèyonmounpasepay‘toannoy,bother;tomistreat’;payadj.‘weak,verylight’(A WEBSTER’S ONLINE DICTIONARY (multilingual Thesaurus,1200 languages, Noah
VaHCE),kafe(a)pay,paykafe‘cafétrèsléger’(ALH790)
Chaudenson, Hector examples, parts of Porter, 1913). It includes Haitian Creole, varieties of Creole, enriched by users via
Poullet, Sylviane Telchid, speech, origin of entry • gua.payn.‘paille’(LMPT) moderation. It recognizes several resemblances and dissemblances in the same
Raphaël Confiant words, (…)
family of Creole languages.
• Falsefriends ("fig" = • A. Bollée / Ingrid Neumann-Holzschuch (University of Regensburg)
figue), uncontrolled [http://www.websters-online-dictionary.org/browse/]
diversion (chokatif - • Sources: Philip Baker, Albert Valdman, Robert Chaudenson
chokativité), erroneous • DECOI structure
integration (abònman / DECA • Paper-format/ Electronic database (XML)
abonnement)
• Diatopic variation • In process
This architecture comprises five main steps:
1. Step 1 (S1): this preliminary step aims at building a first electronic compilation
from existing lexicographical resources. This implies gathering the existing
resources, either digitized resources from paper-based dictionaries, or fully
electronic resources. This also implies identifying available resources. This initial
step is detailed in section 4;
2. Step 2 (S2) : this second step aims at setting up an infrastructure enabling
corpora building and feeding; this means identifying web available resources as
well as setting up automatic procedures to retrieve on a regular basis these
documents; it is detailed in section5.1.
3. Steps 3 and 4: (S3-S4) morphosyntactic analysis of the corpora to maintain the
existing dictionary; automatic analysis of corpora will explicit unknown words, and
some of them will have to be included in the initial dictionary; iteration of this
procedure will permit to complete the existing dictionary, as well as enabling
morphosyntactic annotation of the corpora; this step is detailed in section 5.2.
4. Step 5 (S5) : this step, out of the scope of this paper, will be implemented as
soon as the dictionary is sufficiently completed; annotated corpus could then be
validated and then queried using linguistic and statistical tools, so as to improve
information in the dictionary. It will be evoked in the section 5.3.
Objectives :
- retrieve available (free) data from the web
- two main sources : paper-based and electronically-based dict.
Steps followed : Analysis :
- compilation of available ressources Web Corpora can help building/maintaining lexicographical resources 1/Quick lexicographic coverage of the dictionary: from 28,6% unknown words with
- scripts to retrieve data and combine them into a unique dictionary keeping track Our project will build its dictionary not only from existing electronic resources, but also by about 70% unique words (first corpus) to 7,6% and about 98% of unique words (second
of source and morphosyntactically analyzing corpora (with the help of the previously setup dictionary) corpus)
(Kilgariff, 2003; Baroni, 2009 for example; and for minority languages Scannel, 2007) => whereas a relatively small corpus enable to cover 90% of lexicographic entries, only a
Electronically-based dict.: about 2000 entries really huge corpus enable to tend to 100% coverage. (Zipf’s law)
- purpose of the dictionaries : mainly educational or popularisation Steps : 2/ Dictionary coverage : 123 245 unique words-forms; coverage to be tuned with :
- small lexicons - corpora downloading (on a regular basis), - phrases (about 50% of the vocabulary, see Sag et al, 2002, for example);
- weaknesses of linguistic information (mostly reduced to entry and - automatic morphosyntactic analysis, - unknown words without linguistic information
- manual dictionary improvement. - known words linguistic information
definition / translation)
- spelling variants.
Iterative process :
Paper-based dict.: about 20 000 entries 1/ extracting unknown words as a first step 3/ unknown words
- less numerous 2/ checking meanings associated to recognized words it would be necessary to have automatic procedures to track the meaning evolutions, rather than only word forms
- contain much more linguistic information (pos, examples, translation, spelling => will end up with a more exhaustive dictionary, following the Zipf's law. existence.
variants, cross-references, semantic relations, etymology) -words from other language (specifically from English, in our case),
-misspelled words,
- generally bilingual resources Three main corpora: -proper names,
- require more complex data processing. - first : 1 million words, setup to tune the morphosyntactical analysis and the dictionary -specific notations,
improvement process, with in mind the general-purpose language; -real unknown words.
=> The resulting list has been included in the dictionary without any linguistic information, except for a small part
- second : the Bible, to complete and tune the dictionary with a specific vocabulary; of it with information taken from web-based dictionary (that could not be retrieved globally, but can be used for
- third : evolving corpus, to maintain the dictionary and setup an adequate environment for individual word search) or paper-based dictionaries (see Carrión, 2011 for the list of these dictionaries). For each of
lexicographers. these words, we have decided, in a first step, to include only part of speech, translation in French and English, and
varieties if applicable.
First corpus : 15 different sources (Haïtian Creoles), either manually or automatically
downloaded (« Voa News », « Alter Presse »). 1 208 862 words. Corpus Recognized words Unknown tokens Unknown words /
Second corpus : Bible in Haitian Creole (BibleGateway) downloaded with httrack. about 4 Unique words
On-going project whose goals are to: million words.
1/ Explicit a methodology to improve NLP and linguistics research for minority-languages, Third corpus : RSS feeds and newspaper sites. Continuous downloading. In progress Corpus 1(1 208 862 631 984 (52,3%) 576 878 (47,7%) 345653 (28,6%) / 243
focusing on the American-Caribbean French-Based Creoles; tokens) 892 (70,55%)
2/ Setup procedures and an environment for lexicographers to store and maintain Retrieval and cleaning of web pages (Cartier, 2007, 2009) Corpus 2 (4 505 442 3 722 628 (82,6%) 782 814 (17,4%) 343657 (7,6%) / 336
lexicographical data from existing resources and web corpora. - convert html files into a text format and to UTF-8; tokens) 896 (98,03%)
- identification and retrieval of the textual zone of interest
Results / Perspectives - morphosyntactic tagging
1/ Compilation of existing lexicographical resources : about 20 000 lexicographical - postprocessing : unknown words frequency, CQP search engine…
entries, but exhibited complex-to-solve problems : quality, quantity diverge from one
resource to another, macro and micro-structures are far from unified. => Discussion to Word frequency of unknown words
work with the DECA team; identify unknown words sorted by frequency (see Cartier, 2007 for details).
2/ Dictionary completion and maintenance through web corpora analysis :
=> In the course of setting up a web-based environment to maintain the dictionary and
study lexicographical phenomena through several iterative processes:
- Live corpus download (BootCat, RSS Corpus Builder);
- Cleaning and normalization of web pages;
- POS tagging with existing lexicographical resources;
- Web-based environment for lexicographers to study corpus-based lexicological
phenomena; (IMS Open Corpus Workbench);
This project has generated two crucial elements for the American-Caribbean creoles: a POS
annotated corpus and a POS tagger. These data and tools will be soon released as open-
source.