The Database of Modern Icelandic Inflection (DMII) is a database that stores the full inflectional forms of Icelandic words. It contains over 5.8 million inflected forms. The DMII aims to represent Icelandic inflection accurately without overgeneration by including all inflected forms and variants. A rule-based system was not feasible due to insufficient data and the tendency for rules to overgenerate. The DMII supports language technology projects and is accessible online for the general public.
Categorizing and pos tagging with nltk pythonJanu Jahnavi
https://www.learntek.org/blog/categorizing-pos-tagging-nltk-python/
https://www.learntek.org/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
How to build language technology resources for the next 100 yearsGuy De Pauw
The document discusses how to build sustainable language technology resources for lesser-resourced languages over the next 100 years. It outlines an vision of linguistic diversity and language survival. Key challenges include limited resources, small language communities, and technological limitations. Approaches proposed to work around these include minimizing redundant work, maximizing reuse of resources, building user and developer communities, and preparing resources to work with future technologies. Specific topics covered are types of language technology resources, issues around character encoding, text input methods, and future-proofing keyboard layouts and recognition technologies for many languages.
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
The document discusses the design of a corpus of spoken Irish. It outlines the linguistic background of Irish and issues with existing spoken language resources. It then describes the pilot corpus, including data collection from podcasts and conversations. Transcription guidelines were adapted from CHAT and LDC conventions to balance accuracy with transcription speed. The goal is to create a large, balanced corpus to support research and language preservation.
Natural Language Processing for Amazigh LanguageGuy De Pauw
The document discusses natural language processing challenges for the Amazigh (Berber) language. It outlines Amazigh language characteristics like its writing system and complex phonology/morphology. It then describes the current state of Amazigh NLP technology, including Tifinaghe encoding, optical character recognition tools, basic processing tools like transliterators and stemmers, and language resources like corpora and dictionaries. Finally, it proposes future directions such as developing larger corpora, machine translation systems, and growing human resources for Amazigh language technology.
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
This document discusses learning Amharic verb morphology using inductive logic programming (ILP). Amharic verbs are complex, conveying information about subject, object, tense, aspect, mood and more through affixation, reduplication and compounding. The authors apply ILP to learn morphological rules from a training set of 216 Amharic verbs. They achieve 86.9% accuracy on a test set of 1,784 verb forms. Key challenges include a lack of similar examples in the training data and learning inappropriate alternation rules. This work contributes to advancing the automatic learning of morphology for under-resourced languages like Amharic.
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the world of natural language - unstructured data that by its very nature has important latent information for humans. NLP practitioners have benefitted from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, the Natural Language Toolkit (NLTK), and to a lesser extent, the Gensim Library.
NLTK is an excellent library for machine learning-based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. Gensim provides vector-based topic modeling, which is currently absent in both NLTK and Scikit-Learn. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
- The document discusses different approaches to defining word meaning, including lexicographic traditions of enumerating senses in dictionaries, ontological approaches using taxonomies of concepts, and distributional approaches using vector representations based on word context.
- It covers challenges with the traditional word sense disambiguation task, such as the skewed distribution of word senses and implicit disambiguation in context. Dimensionality reduction techniques and models like word2vec are discussed as distributional methods to learn word vectors from large corpora that capture semantic relationships.
Categorizing and pos tagging with nltk pythonJanu Jahnavi
https://www.learntek.org/blog/categorizing-pos-tagging-nltk-python/
https://www.learntek.org/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
How to build language technology resources for the next 100 yearsGuy De Pauw
The document discusses how to build sustainable language technology resources for lesser-resourced languages over the next 100 years. It outlines an vision of linguistic diversity and language survival. Key challenges include limited resources, small language communities, and technological limitations. Approaches proposed to work around these include minimizing redundant work, maximizing reuse of resources, building user and developer communities, and preparing resources to work with future technologies. Specific topics covered are types of language technology resources, issues around character encoding, text input methods, and future-proofing keyboard layouts and recognition technologies for many languages.
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
The document discusses the design of a corpus of spoken Irish. It outlines the linguistic background of Irish and issues with existing spoken language resources. It then describes the pilot corpus, including data collection from podcasts and conversations. Transcription guidelines were adapted from CHAT and LDC conventions to balance accuracy with transcription speed. The goal is to create a large, balanced corpus to support research and language preservation.
Natural Language Processing for Amazigh LanguageGuy De Pauw
The document discusses natural language processing challenges for the Amazigh (Berber) language. It outlines Amazigh language characteristics like its writing system and complex phonology/morphology. It then describes the current state of Amazigh NLP technology, including Tifinaghe encoding, optical character recognition tools, basic processing tools like transliterators and stemmers, and language resources like corpora and dictionaries. Finally, it proposes future directions such as developing larger corpora, machine translation systems, and growing human resources for Amazigh language technology.
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
This document discusses learning Amharic verb morphology using inductive logic programming (ILP). Amharic verbs are complex, conveying information about subject, object, tense, aspect, mood and more through affixation, reduplication and compounding. The authors apply ILP to learn morphological rules from a training set of 216 Amharic verbs. They achieve 86.9% accuracy on a test set of 1,784 verb forms. Key challenges include a lack of similar examples in the training data and learning inappropriate alternation rules. This work contributes to advancing the automatic learning of morphology for under-resourced languages like Amharic.
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the world of natural language - unstructured data that by its very nature has important latent information for humans. NLP practitioners have benefitted from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, the Natural Language Toolkit (NLTK), and to a lesser extent, the Gensim Library.
NLTK is an excellent library for machine learning-based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. Gensim provides vector-based topic modeling, which is currently absent in both NLTK and Scikit-Learn. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
- The document discusses different approaches to defining word meaning, including lexicographic traditions of enumerating senses in dictionaries, ontological approaches using taxonomies of concepts, and distributional approaches using vector representations based on word context.
- It covers challenges with the traditional word sense disambiguation task, such as the skewed distribution of word senses and implicit disambiguation in context. Dimensionality reduction techniques and models like word2vec are discussed as distributional methods to learn word vectors from large corpora that capture semantic relationships.
This document discusses ontologies in computer science from the perspective of description logics. It begins by defining ontologies in philosophy and computer science. Description logics are introduced as a family of knowledge representation languages used to formally specify ontologies. Description logics allow concepts and roles to be built using constructors like conjunction, disjunction, and quantification. Description logic knowledge bases consist of a TBox containing terminology and an ABox containing facts about individuals. Reasoning tasks like satisfiability can be used to check a knowledge base for internal consistency.
Natural language processing (NLP) is a field that develops techniques to allow computers to analyze, understand, and generate human language. NLP aims to address challenges in areas like information extraction, automatic summarization, and dialogue systems. OpenNLP is an open source Java toolkit that provides common NLP tasks like sentence detection, tokenization, part-of-speech tagging, named entity recognition, and parsing.
Automatic Annotation Of Bibliographical References With TargetSean Flores
This document discusses automatically annotating bibliographical references with the target language described. It presents the problem as a special case of information extraction where the task is to classify short text snippets into one of around 7,000 possible language classes. The document describes the data available, including a database of over 7,000 languages/language names and bibliographic references in various languages. It explores using this data and distributional characteristics of the domain to develop supervised and unsupervised approaches to solve the annotation problem.
The document provides an introduction to natural language processing (NLP), discussing key related areas and various NLP tasks involving syntactic, semantic, and pragmatic analysis of language. It notes that NLP systems aim to allow computers to communicate with humans using everyday language and that ambiguity is ubiquitous in natural language, requiring disambiguation. Both manual and automatic learning approaches to developing NLP systems are examined.
The document provides information on various natural language processing (NLP) techniques for text data preprocessing, modeling, and analysis. It includes code snippets and explanations for tokenizing text, removing stop words, stemming, lemmatization, bag-of-words modeling, Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), TF-IDF, and Doc2Vec. It also discusses storing preprocessed text data in dictionaries or DataFrames and evaluating model similarities to find related documents.
Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014Dag Endresen
Implementation of persistent and globally unique identifiers for specimens held in natural history collections worldwide will open up new opportunities for referring to these physical resources in an interlinked digital context such as the Internet. Here, we will describe the approach for persistent identification of collection specimens developed and implemented at the Natural History Museum in Oslo (NHM-UiO) by the the Norwegian participant node to the Global Biodiversity Information Facility (GBIF-Norway). The Norwegian university museums are invited to use our resolver service at "http://purl.org/gbifnorway/id/<uuid>" when publishing biodiversity data to GBIF. All occurrence records published through GBIF-Norway, with appropriate PURL-UUID identifiers mapped to the Darwin Core occurrenceID, will automatically be added to our resolver service and kept updated.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
This document provides an introduction to natural language processing (NLP). It discusses key topics in NLP including languages and intelligence, the goals of NLP, applications of NLP, and general themes in NLP like ambiguity in language and statistical vs rule-based methods. The document also previews specific NLP techniques that will be covered like part-of-speech tagging, parsing, grammar induction, and finite state analysis. Empirical approaches to NLP are discussed including analyzing word frequencies in corpora and addressing data sparseness issues.
Getting Started with the Hymenoptera Anatomical OntologyKatja C. Seltmann
For Biodiversity Informatics workshop in Stockholm, Friday September 18. Describing the Hymenoptera Anatomical Ontology. Authors: Matthew Yoder, Andrew Deans, Katja Seltmann, István Mikó, Matthew Bertone
The document discusses formal language theory and its applications in natural language processing (NLP). It covers two main goals in computational linguistics - theoretical interest in formally characterizing natural language and practical interest in using well-understood frameworks like finite state models to solve NLP problems. Finite state devices are widely used in NLP tasks due to their efficiency and ability to model linguistic phenomena like words through dictionaries and rules. While finite state models provide a useful approximation of language, natural languages pose challenges like ambiguity, long distance dependencies and non-regular features that require extensions to basic finite state models.
Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
This document provides an overview of automatic spelling correction techniques. It discusses using character n-grams to represent words and generate spelling correction candidates based on edit distance. An information retrieval model is used where the misspelled word is the query and candidate corrections are documents. Inverted indexing is discussed as a way to implement this model using MapReduce. Reducers concatenate document IDs for each term to build the final inverted index, with optimizations needed to handle large document frequencies.
The document discusses several key topics in natural language processing and computational linguistics:
1. It defines the basic units of language like words, tokens, types and texts.
2. It describes techniques for extracting text from various sources like files, web pages and corpora and preprocessing the text by removing HTML tags and normalizing whitespace.
3. It discusses empirical observations about word frequencies like Zipf's Law and Heap's Law, which state that a small number of words occur very frequently while most words occur rarely.
The document discusses several key topics in natural language processing and computational linguistics:
1. It defines the basic units of language like words, tokens, types and texts.
2. It describes techniques for extracting text from various sources like files, web pages and corpora and preprocessing the text by removing HTML tags and normalizing whitespace.
3. It discusses empirical observations about word frequencies like Zipf's Law and Heap's Law, which state that a small number of words appear very frequently while most words appear rarely.
Horst Goes Pop - Wieviel Musikempfehlung braucht der MenschStephan Baumann
Holistic Recommendation and StoryTelling Technology (HORST): A blueprint, presented at the Future Music Barcamp at Pop Academy Mannheim Germany. Received lots of hate and love. This is NOT an academic in-depth talk although the scientific background is available in detail upon request. Enjoy!
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
The document describes a tutorial on using neural networks for information retrieval. It discusses an agenda for the tutorial that includes fundamentals of IR, word embeddings, using word embeddings for IR, deep neural networks, and applications of neural networks to IR problems. It provides context on the increasing use of neural methods in IR applications and research.
Semantics of and for the diversity of life: Opportunities and perils of tryi...Hilmar Lapp
The document discusses using ontologies and reasoning to analyze biodiversity data at scale. It proposes using phyloreferences - a universal coordinate system for life based on phylogenetic trees - to unambiguously refer to taxa. It also describes representing phylogenetic clade definitions and the Tree of Life as an ontology, allowing taxon names and relationships to be rendered computable. The document outlines how ontologies and reasoning could be used to infer missing phenotypic data, synthesize knowledge about trait evolution, and generate hypotheses about genes involved in morphological diversity by linking model organism and comparative data. Scaling reasoning to the size of biodiversity knowledge is discussed, along with a proposed architecture using specialized reasoners and triplestores.
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
This project aims to build and maintain a lexicographical resource for French-based Creole languages through three main steps:
1) Compiling existing lexicographical resources like dictionaries into an electronic format
2) Creating corpora of Creole language texts from literary, educational and journalistic sources online
3) Maintaining the dictionary by analyzing the corpora to identify unknown words and improve the database through an iterative process.
The results will be a lexicographical database detailing variations in French-based Creoles and an annotated corpora for linguistic research.
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
This document summarizes research that semi-automatically extracted a morphological grammar for Southern Ndebele, an under-resourced language, from a general Nguni morphological analyzer bootstrapped from a Zulu analyzer. The Southern Ndebele analyzer produced surprisingly good results, showing significant similarities across Nguni languages that can accelerate documentation and resource development for these languages. The project followed best practices for encoding resources to ensure sustainability, access, and adaptability to future formats and platforms.
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
This document proposes a bag-of-substrings approach to part-of-speech tagging for under-resourced Bantu languages using available digital dictionaries and word lists instead of large annotated corpora. Experimental results showed the technique established a low-resource, high accuracy method for bootstrapping POS tagging that compares favorably to state-of-the-art data-driven approaches. The method extracts substring features from words to train a maximum entropy classifier and bootstrap POS tagging for Bantu languages that lack extensive annotated resources.
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
This document presents a new 50+ million word corpus of the Tajik language, the largest available. It was created by crawling over a dozen Tajik news websites and other sources. The texts were joined and cleaned to remove duplicates. The corpus was then annotated with morphological analysis of Tajik using a new analyzer created by modifying an existing one to be faster and allow lemmatization. The analyzer recognizes over 87% of words and tags them with part of speech. This annotated corpus containing lemmas, tags and frequencies is available online through the Sketch Engine for researchers.
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
This document describes using a metagrammar called XMG to formally capture morphological generalizations of verbs in the Ikota language. It provides an XMG specification for Ikota verbal morphology that describes subject, tense, verb root, aspect, active voice, and proximity. This specification can automatically derive a lexicon of fully inflected verb forms. The methodology allows for quickly testing ideas and validating results against language data.
More Related Content
Similar to The Database of Modern Icelandic Inflection
This document discusses ontologies in computer science from the perspective of description logics. It begins by defining ontologies in philosophy and computer science. Description logics are introduced as a family of knowledge representation languages used to formally specify ontologies. Description logics allow concepts and roles to be built using constructors like conjunction, disjunction, and quantification. Description logic knowledge bases consist of a TBox containing terminology and an ABox containing facts about individuals. Reasoning tasks like satisfiability can be used to check a knowledge base for internal consistency.
Natural language processing (NLP) is a field that develops techniques to allow computers to analyze, understand, and generate human language. NLP aims to address challenges in areas like information extraction, automatic summarization, and dialogue systems. OpenNLP is an open source Java toolkit that provides common NLP tasks like sentence detection, tokenization, part-of-speech tagging, named entity recognition, and parsing.
Automatic Annotation Of Bibliographical References With TargetSean Flores
This document discusses automatically annotating bibliographical references with the target language described. It presents the problem as a special case of information extraction where the task is to classify short text snippets into one of around 7,000 possible language classes. The document describes the data available, including a database of over 7,000 languages/language names and bibliographic references in various languages. It explores using this data and distributional characteristics of the domain to develop supervised and unsupervised approaches to solve the annotation problem.
The document provides an introduction to natural language processing (NLP), discussing key related areas and various NLP tasks involving syntactic, semantic, and pragmatic analysis of language. It notes that NLP systems aim to allow computers to communicate with humans using everyday language and that ambiguity is ubiquitous in natural language, requiring disambiguation. Both manual and automatic learning approaches to developing NLP systems are examined.
The document provides information on various natural language processing (NLP) techniques for text data preprocessing, modeling, and analysis. It includes code snippets and explanations for tokenizing text, removing stop words, stemming, lemmatization, bag-of-words modeling, Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), TF-IDF, and Doc2Vec. It also discusses storing preprocessed text data in dictionaries or DataFrames and evaluating model similarities to find related documents.
Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014Dag Endresen
Implementation of persistent and globally unique identifiers for specimens held in natural history collections worldwide will open up new opportunities for referring to these physical resources in an interlinked digital context such as the Internet. Here, we will describe the approach for persistent identification of collection specimens developed and implemented at the Natural History Museum in Oslo (NHM-UiO) by the the Norwegian participant node to the Global Biodiversity Information Facility (GBIF-Norway). The Norwegian university museums are invited to use our resolver service at "http://purl.org/gbifnorway/id/<uuid>" when publishing biodiversity data to GBIF. All occurrence records published through GBIF-Norway, with appropriate PURL-UUID identifiers mapped to the Darwin Core occurrenceID, will automatically be added to our resolver service and kept updated.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
This document provides an introduction to natural language processing (NLP). It discusses key topics in NLP including languages and intelligence, the goals of NLP, applications of NLP, and general themes in NLP like ambiguity in language and statistical vs rule-based methods. The document also previews specific NLP techniques that will be covered like part-of-speech tagging, parsing, grammar induction, and finite state analysis. Empirical approaches to NLP are discussed including analyzing word frequencies in corpora and addressing data sparseness issues.
Getting Started with the Hymenoptera Anatomical OntologyKatja C. Seltmann
For Biodiversity Informatics workshop in Stockholm, Friday September 18. Describing the Hymenoptera Anatomical Ontology. Authors: Matthew Yoder, Andrew Deans, Katja Seltmann, István Mikó, Matthew Bertone
The document discusses formal language theory and its applications in natural language processing (NLP). It covers two main goals in computational linguistics - theoretical interest in formally characterizing natural language and practical interest in using well-understood frameworks like finite state models to solve NLP problems. Finite state devices are widely used in NLP tasks due to their efficiency and ability to model linguistic phenomena like words through dictionaries and rules. While finite state models provide a useful approximation of language, natural languages pose challenges like ambiguity, long distance dependencies and non-regular features that require extensions to basic finite state models.
Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
This document provides an overview of automatic spelling correction techniques. It discusses using character n-grams to represent words and generate spelling correction candidates based on edit distance. An information retrieval model is used where the misspelled word is the query and candidate corrections are documents. Inverted indexing is discussed as a way to implement this model using MapReduce. Reducers concatenate document IDs for each term to build the final inverted index, with optimizations needed to handle large document frequencies.
The document discusses several key topics in natural language processing and computational linguistics:
1. It defines the basic units of language like words, tokens, types and texts.
2. It describes techniques for extracting text from various sources like files, web pages and corpora and preprocessing the text by removing HTML tags and normalizing whitespace.
3. It discusses empirical observations about word frequencies like Zipf's Law and Heap's Law, which state that a small number of words occur very frequently while most words occur rarely.
The document discusses several key topics in natural language processing and computational linguistics:
1. It defines the basic units of language like words, tokens, types and texts.
2. It describes techniques for extracting text from various sources like files, web pages and corpora and preprocessing the text by removing HTML tags and normalizing whitespace.
3. It discusses empirical observations about word frequencies like Zipf's Law and Heap's Law, which state that a small number of words appear very frequently while most words appear rarely.
Horst Goes Pop - Wieviel Musikempfehlung braucht der MenschStephan Baumann
Holistic Recommendation and StoryTelling Technology (HORST): A blueprint, presented at the Future Music Barcamp at Pop Academy Mannheim Germany. Received lots of hate and love. This is NOT an academic in-depth talk although the scientific background is available in detail upon request. Enjoy!
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
The document describes a tutorial on using neural networks for information retrieval. It discusses an agenda for the tutorial that includes fundamentals of IR, word embeddings, using word embeddings for IR, deep neural networks, and applications of neural networks to IR problems. It provides context on the increasing use of neural methods in IR applications and research.
Semantics of and for the diversity of life: Opportunities and perils of tryi...Hilmar Lapp
The document discusses using ontologies and reasoning to analyze biodiversity data at scale. It proposes using phyloreferences - a universal coordinate system for life based on phylogenetic trees - to unambiguously refer to taxa. It also describes representing phylogenetic clade definitions and the Tree of Life as an ontology, allowing taxon names and relationships to be rendered computable. The document outlines how ontologies and reasoning could be used to infer missing phenotypic data, synthesize knowledge about trait evolution, and generate hypotheses about genes involved in morphological diversity by linking model organism and comparative data. Scaling reasoning to the size of biodiversity knowledge is discussed, along with a proposed architecture using specialized reasoners and triplestores.
Similar to The Database of Modern Icelandic Inflection (16)
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
This project aims to build and maintain a lexicographical resource for French-based Creole languages through three main steps:
1) Compiling existing lexicographical resources like dictionaries into an electronic format
2) Creating corpora of Creole language texts from literary, educational and journalistic sources online
3) Maintaining the dictionary by analyzing the corpora to identify unknown words and improve the database through an iterative process.
The results will be a lexicographical database detailing variations in French-based Creoles and an annotated corpora for linguistic research.
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
This document summarizes research that semi-automatically extracted a morphological grammar for Southern Ndebele, an under-resourced language, from a general Nguni morphological analyzer bootstrapped from a Zulu analyzer. The Southern Ndebele analyzer produced surprisingly good results, showing significant similarities across Nguni languages that can accelerate documentation and resource development for these languages. The project followed best practices for encoding resources to ensure sustainability, access, and adaptability to future formats and platforms.
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
This document proposes a bag-of-substrings approach to part-of-speech tagging for under-resourced Bantu languages using available digital dictionaries and word lists instead of large annotated corpora. Experimental results showed the technique established a low-resource, high accuracy method for bootstrapping POS tagging that compares favorably to state-of-the-art data-driven approaches. The method extracts substring features from words to train a maximum entropy classifier and bootstrap POS tagging for Bantu languages that lack extensive annotated resources.
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
This document presents a new 50+ million word corpus of the Tajik language, the largest available. It was created by crawling over a dozen Tajik news websites and other sources. The texts were joined and cleaned to remove duplicates. The corpus was then annotated with morphological analysis of Tajik using a new analyzer created by modifying an existing one to be faster and allow lemmatization. The analyzer recognizes over 87% of words and tags them with part of speech. This annotated corpus containing lemmas, tags and frequencies is available online through the Sketch Engine for researchers.
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
This document describes using a metagrammar called XMG to formally capture morphological generalizations of verbs in the Ikota language. It provides an XMG specification for Ikota verbal morphology that describes subject, tense, verb root, aspect, active voice, and proximity. This specification can automatically derive a lexicon of fully inflected verb forms. The methodology allows for quickly testing ideas and validating results against language data.
Tagging and Verifying an Amharic News CorpusGuy De Pauw
This document summarizes an Amharic news corpus tagging and verification project. It discusses the Amharic language background, the corpus creation from Ethiopian news sources, the manual tagging process, previous tagging experiments, and the current efforts to clean and re-tag the corpus which involves removing errors and inconsistencies from the original tagging. Baseline tagging performance on the corpus using different part-of-speech tagsets ranges from 58.3% to 90.8% correct depending on the tagset and machine learning approach used.
This document describes the process of constructing a corpus of spoken and written Santome, a Portuguese-related creole language spoken in Sao Tome and Principe. The corpus contains over 184,000 words from written sources like newspapers and books, as well as transcribed spoken recordings. Efforts were made to standardize the orthography and develop part-of-speech tags for annotation. Metadata is encoded for each text, and the corpus will be made available through a concordancing tool to allow searches while copyright permissions are obtained. The goal is for this and related Gulf of Guinea creole corpora to enable comparative linguistic research.
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
The document describes a system for automatically structuring and correcting Hungarian clinical records. The system separates clinical records into structured XML elements, tags metadata, and separates text from tables. It also corrects spelling errors using language models and weighted edit distances to generate and score candidate corrections. Evaluation showed the system could provide the right correction in the top 5 suggestions for 99% of errors. Areas for improvement include handling insertion/deletion errors and using larger language resources to better handle non-standard usage.
Compiling Apertium Dictionaries with HFSTGuy De Pauw
This document discusses compiling Apertium dictionaries with HFST to leverage generalised compilation formulas and get more applications from fewer language descriptions. Compiling Apertium dictionaries natively in HFST provides benefits like uniform compilation across tools, improved resulting automata using HFST algorithms, and integrated complex finite-state morphology features. Additional applications like spellcheckers can also be automatically generated from the dictionaries.
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
The document proposes standardizing test sets for evaluating compound analyzers by establishing parameters for a standard test set. It discusses evaluating compound analyzers on different sized test sets containing compound words, non-compound words, and error words. Experiments compare analyzer performance on test sets of varying sizes, finding sizes below 250 words are too small and sizes above 1250 words show no significant differences in results. The proposed standard test set consists of 500 examples each of compounds, non-compounds, and errors for a total of 1500 words.
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
The document discusses new paradigms for developing African language resources through the Pan African Living Dictionary Online (PALDO) project. The paradigms include open community participation under scholarly supervision, paying for data development and making the data freely available, and linking monolingual dictionaries for multiple languages by concept to create rich resources for each language.
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
1. The document presents a system for recognizing handwritten Yoruba characters using a Bayesian classifier and decision tree approach.
2. Key stages of the system include preprocessing, segmentation, feature extraction, Bayesian classification, decision tree processing, and result fusion.
3. The system was tested on independent and non-independent character samples, achieving recognition rates of 91.18% and 94.44% respectively.
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
The document describes the development of an English to Yoruba machine translation system called IFE-MT. It discusses the theoretical and practical issues in building the system, including the differences between the languages. It outlines the data collection and annotation process. It also describes the software tools and modules used to implement the system and demonstrates its capabilities. The system is being further developed by expanding the database and evaluating the translations.
A Number to Yorùbá Text Transcription SystemGuy De Pauw
The document describes a system for converting Arabic numbers to their Yoruba lexical equivalents. It discusses Yoruba numerals and their derivation using addition, subtraction and multiplication. A computational model is presented using pushdown automata to capture the number conversion. The system was implemented in Python and evaluated using Mean Opinion Score testing. Examples of number conversions like 19679 are provided to demonstrate the system.
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Guy De Pauw
This document discusses experiments conducted on developing an English to Amharic statistical machine translation system. It summarizes the collection and processing of two parallel corpora: (1) An ENA news corpus containing over 35,000 documents and 500,000 words aligned at the sentence level. (2) A parliamentary corpus containing over 1,200 documents and 500,000 words aligned at the sentence level. The document outlines challenges in aligning the corpora and reports on automatic alignment results. It concludes that further increasing corpus size and integrating linguistic knowledge can help improve translation quality.
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
The document proposes standardizing test sets for evaluating compound analyzers by establishing parameters for a standard test set. It discusses evaluating compound analyzers on different sized test sets containing compound words, non-compound words, and error words. Experiments compare analyzer performance on test sets of varying sizes, finding sizes below 250 words are too small and sizes above 1250 words show no significant differences in results. The proposed standard test set consists of 500 examples each of compounds, non-compounds, and errors for a total of 1500 words.
This document discusses document clustering in the Amharic language for information browsing and retrieval. It introduces the challenges of searching and accessing information in Amharic due to the growing amount of digital documents. The document then describes the process of document clustering, which groups documents based on similarities to organize information. Key steps in the clustering process include document preprocessing, vector representation, and hierarchical clustering. Experimental results show that tuning the global support threshold is important for creating the desired hierarchy, and stemming affects cluster overlap. Future work could involve developing standard Amharic language resources and comparing different clustering and information retrieval methods.
Modeling Improved Syllabification Algorithm for AmharicGuy De Pauw
The document describes a proposed improved syllabification algorithm for Amharic text. It begins with an introduction to syllables and syllabification. It then discusses Amharic syllable structure, including allowable templates, gemination, and consonant clusters. The authors designed a rule-based syllabification model with five modules: transliteration, gemination handling, epenthesis, syllabification, and stress assignment. The model is based on expert linguistic knowledge of Amharic phonology and utilizes syllable templates, gemination rules, epenthesis rules, and a sonority scale to identify syllable boundaries in text.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Infrastructure Challenges in Scaling RAG with Custom AI models
The Database of Modern Icelandic Inflection
1. The Database of
Modern Icelandic Inflection
Kristín Bjarnadóttir
The Árni Magnússon Institute for Icelandic Studies
Reykjavík, Iceland
SaLTMIL-AfLat
Istanbul May 22th 2012
2. The Database of Modern Icelandic Inflection (DMII)
A database storing full inflectional forms
Topic:
• The aim in creating the DMII
• A short description of the DMII
u
• Why is the DMII not a productive system of rules?
The necessary information for a rule system was not available
A rule system would overgenerate wildly
• The problem of a large tag set: Data scarcity
• The two aspects of the DMII: LT and online
‘Modern’ in the title is used in the sense ‘current’.
3. The Database of Modern Icelandic Inflection
A database storing full inflectional forms
Used in Language Technology.
Accessible on-line for the general public.
Headwords/paradigms:
u
271,400
Inflectional forms (with tag):
5,800,000
Unique word forms:
2,800,000
PoS tag set:
700 +
Inflectional classes:
630 +
25% of the inflectional classes for the major word classes describe
the idiosyncratic inflection of single words.
4. The aim in creating the DMII
Showing Icelandic inflectional ‘as is’
• All inflectional forms, including variants
• No overgeneration
u
• No overlap between inflectional classes, i.e., a word can only
belong to 1 inflectional class (including variants)
• An inflectional class is a bundle of inflectional rules for a set
of words.
• Production: A simple matrix of inflectional rules, with flags for
values for inflectional categories: +/-singular, +/-plural, etc.
5. The Database of Modern Icelandic Inflection
Icelandic inflection is rich.
Inflectional categories for the major word classes
u
and number of inflectional forms:
Nouns:
case (4), number (2), definiteness (2)
= 16
Adjectives: gender (3), case (4), number (2), definiteness (2), degree (3) = 120
Verbs:
= 106
Finite:
voice (2), mood (2), tense (2), person (3), number (2)
= 48
Nonfinite:
= 58
Imperative: voice (2), number (2) (+ 1 truncated form)
= 5
Infinitive: voice (2)
= 2
Past participle: gender (3), case (4), number (2), definiteness (2) = 48
Present participle:
= 1
Supine: voice (2)
= 2
6.
7. DMII: Output for LT projects
16 inflectional forms of the noun köttur ‘cat’
lemma id infl.form tag English tag
köttur;416784;kk;alm;köttur;NFET nom.sg.indef.
köttur;416784;kk;alm;kötturinn;NFETgr nom.sg.def.
köttur;416784;kk;alm;kött;ÞFET acc.sg.indef.
köttur;416784;kk;alm;ketti;ÞGFET
u
köttur;416784;kk;alm;köttinn;ÞFETgr acc.sg.def.
dat.sg.indef.
köttur;416784;kk;alm;kettinum;ÞGFETgr dat.sg.def.
köttur;416784;kk;alm;kattar;EFET gen.sg.indef.
köttur;416784;kk;alm;kattarins;EFETgr gen.sg.def.
köttur;416784;kk;alm;kettir;NFFT nom.pl.indef.
köttur;416784;kk;alm;kettirnir;NFFTgr nom.pl.def.
köttur;416784;kk;alm;ketti;ÞFFT acc.pl.indef.
köttur;416784;kk;alm;kettina;ÞFFTgr acc.pl.def.
köttur;416784;kk;alm;köttum;ÞGFFT dat.pl.indef.
köttur;416784;kk;alm;köttunum;ÞGFFTgr dat.pl.def.
köttur;416784;kk;alm;katta;EFFT gen.pl.indef.
köttur;416784;kk;alm;kattanna;EFFTgr gen.pl.def.
8. Creating the DMII
Inflectional class for the noun köttur ‘cat’
kk_sb_X.ur.-ar-ir.-i-inum
Set of stems: X1=kött, X2=kött, X3=kett, X4=katt
Nom.sg.
X1 +ur
u
Matrix:
Nom.sg.def.
X1 +urinn
Acc.sg.
X1 +
Acc.sg.def.
X1 +inn
Dat.sg.
X3 +i
Dat.sg.def.
X3 +inum
Gen.sg.
X4 +ar
Gen.sg.def.
X4 +arins
Nom.pl.
X3 +ir
Nom.pl.def.
X3 +irnir
Acc.pl.
X3 +i
Acc.pl.def.
X3 +ina
Dat.pl.
X2 +um
Dat.pl.def.
X2 +unum
Gen.pl.
X4 +a
Gen.pl.def.
X4 +anna
9. Creating the DMII
Inflectional class for the noun köttur ‘cat’
kk_sb_X.ur.-ar-ir.-i-inum
Set of stems: X1=kött, X2=kött, X3=kett, X4=katt
u
Paradigm:
Nom.sg.
kött+ur
Nom.sg.def.
kött+urinn
Acc.sg.
kött+
Acc.sg.def.
kött+inn
Dat.sg.
kett+i
Dat.sg.def.
kett+inum
Gen.sg.
katt+ar
Gen.sg.def.
katt+arins
Nom.pl.
kett+ir
Nom.pl.def.
kett+irnir
Acc.pl.
kett+i
Acc.pl.def.
kett+ina
Dat.pl.
kött+um
Dat.pl.def.
kött+unum
Gen.pl.
katt+a
Gen.pl.def.
katt+anna
10. The DMII: A multitude of variants
Inflectional classes: Over 630, due to inflectional variants
Examples of variants:
The genitive singular ending –ar/–s in masculine nouns:
u
þröskuldur ‘threshold’: gen. þröskuldar/þröskulds
The dative singular ending –i/–0 in masculine and neuter nouns:
bátur ‘boat’ (masc.):
báti/bát
fennel ‘fennel’ (neut.):
fennel/fenneli
arsenik ‘arsenic’ (neut.):
arseniki/arsenik
ópíum ‘opium’ (neut.):
ópíum/*ópíumi
Name (3 variants pr. case):
Berglind (fem.): acc.+dat.: Berglind/Berglindi/Berglindu
11. Why not a productive system of rules?
Insufficient data
Lexicographic sources:
• Extensive vocabulary.
• Partial information on inflection, i.e., principle parts only:
köttur (nom.sg.indef.), kattar (gen.sg.indef.), kettir (nom.pl.indef.)
u
• The rest of the paradigm is not (necessarily) predictable.
Grammatical surveys:
• Full paradigms of selected examples.
• A set of generalized inflectional classes, showing exceptions at times (often in
notes).
• The emphasis is on the core Icelandic vocabulary, to the exclusion of loanwords,
slang and informal language.
• Long history of research (1st survey 1651)
• Strong literary tradition; strong public interest, strong tendency to purism.
Generalization, as in the claim: “The dative ending –i is universal in neuter
nouns”.
12. A gap in the inflectional descriptions
Dative –i/-0 in multisyllabic neuter nouns
Hvönn er svolítið lík fennel.
‘Angelica is a little bit like fennel’
. . . ásamt brytjuðu fenneli.
u
‘. . . with diced fennel’
. . . byrlað eitur, drepinn með arsenik.
‘. . . poisoned with arsenic’
Hann var myrtur með arseniki.
‘He was murdered with arsenic’
cf.
Hann var myrtur með ópíum/*ópíumi.
‘He was murdered with opium’
Why not a productive system of rules: (2)
It would overgenerate. The data has to be collected first.
13. Research for the DMII
The search for possible variant inflectional forms
• ll available sources are used:
A
• Archives, digitized text collections, corpora, Google (at a pinch)
• Native speaker intuition (linguists and others, including users on-
line)
u
• The search is sometimes inconclusive:
Yggdrasill (masc.) ‘the great tree’ from Norse mythology:
o Dative Yggdrasil or Yggdrasli?
o The regular dative would be Yggdrasli.
o No examples of the dative can be found in the Old Icelandic
sources; modern examples are dubious.
o The dative of the neuter noun drasl ‘rubbish’ is drasli.
o When referring to a shop in today’s Reykjavík named Yggdrasill speakers
seem to avoid the dative. (It makes people laugh.)
o Examples of Yggdrasli appear on the web, mostly in disputes on the
dative ...
14. Research for the DMII
Ambiguous inflectional forms make searching time-consuming
Inflectional forms in DMII
5,881,374
Unambiguous
1,850,090 31.5%
Ambiguous within 1 lemma
3,619,482 61.5%
Ambiguous between lemmas
u
Ambiguous within and between lemmas
63,641 1.1%
348,161 5.9%
Inflectional form: Word form with inflectional tag.
Word forms (unique charater strings):
2.8 million
Unambiguous (unique inflectional forms): 1.8
Ambiguous:
1.0
Searching in unannotated text can be extremely slow.
15. Using the MÍM Corpus (The Tagged Icelandic Corpus)
cf. Sigrún Helgadóttir et. al., Poster session here at SaLTMIL
• 25 million running words from 21st Century texts.
• Compatible tags with the DMII.
u
• For the reason of ambiguity mentioned before, the first opportunity
of systematic research on the inflectional system for the use in the
DMII.
First stage of comparing the MÍM and the DMII:
Tokens in MÍM 16,245,429
Unique tagged forms in MÍM 737,856
In DMII 425,238
Not in DMII 312,618
16. Analysis of the word forms from the MÍM Corpus
not found in the DMII
312,618 tagged word forms (or strings) (out of 737,856)
True Icelandic inflectional forms
60%
Miscellaneous strings
u
40%
Foreign words
25%
Errors
6%
Abbreviations & acronyms
1.7%
Computer strings (Urls, etc.)
0.7%
All word forms showing features of adjustment to the Icelandic
inflectional system are counted as Icelandic inflectional forms.
Approx. additions to the DMII: 125,000 paradigms
+ uncounted “new” variants.
17. Data scarcity in a rich morphology
The MÍM Corpus:
**
623,000 inflectional forms.
The present DMII:
270,000 paradigms
5,800,000 inflectional forms
DMII with addions form MÍM:
u
395,000* paradigms
8,300,000* inflectional forms
**Figure for lemmas in MÍM not available yet.
* Estimated figures
• This comes as no surprise to Icelandic lexicographers; we have
always had to cope with the problem of scarcity.
• A description of Icelandic inflection based solely on the MÍM
corpus would be very meager.
18. Conclusion
• The DMII was initially made for two purposes:
As an LT resource (including lexicography)
For reference for the general public
u
• In spite of a long history of research, the available data was
insufficient. The DMII has therefore become increasingly important
in research on Icelandic morphology.
• Both corpora and lexicographic data is needed for the DMII.
• As the scope of the DMII expands, the production of a rule system
becompes more feasible, if one is needed.
With a PS on availability ...
19. Availability
The DMII is available for LT projects, free of charge.
Download: http://bin.arnastofnun.is/gogn/
Online version: http://bin.arnastofnun.is
u
The website is as yet only in Icelandic. Send me an email for an
English version of the conditions on the use of data from the DMII
or any questions: kristinb@hi.is
The website is being restructured. The new website will contain
extensive comments, explanations of grammatical features, etc.
Work on an English version of this part of the site is in progress.
The metalanguage of the paradigms themselves will be Icelandic.
Be in touch if you need information.
20.
21. Thank you for your attention
Kristín B
kristinb@hi.is
SaLTMIL-AfLat
Istanbul May 22th 2012