SlideShare a Scribd company logo
A repository of free lexical resources
                             for African languages:
                           the project and the method
                               Piotr Bański                                                                   Beata Wójtowicz
                       Institute of English Studies,                                               Dept. of African Languages and Cultures,
                          University of Warsaw                                                               University of Warsaw
                       E-mail: pkbanski@uw.edu.pl                                                      E-mail: b.wojtowicz@uw.edu.pl

                                                                                                                                                Below are the possible stages of
                                                               Summary                                                                          development of an example entry;
                                                                                                                                                from the simplest glossary to
    Our focus here is on FreeDict [http://www.freedict.org/], a project that has the potential to become home to,
                                                                                                                                                something close to a machine-
    among others, free bilingual dictionaries for African languages. The project is part of SourceForge.net.
                                                                                                                                                processable lexical database (we
    The dictionaries can be usable even in their early versions, which can be subject to further supervised                                     skip xml:lang attributes).
    improvement as user feedback accumulates: “publish early, publish often”, in the open-source way.
                                                                                                                                                <entry>
    We demonstrate a possible process of dictionary development on the example of one of FreeDict dictionaries                                     <form><orth>alasiri</orth></form>
                                                                                                                                                   <def>afternoon</def>
    – Swahili-English xFried/Freedict Dictionary – the first FreeDict dictionary encoded according to the TEI P5                                </entry>
    XML standard.
    The final product can be accessed via desktop clients, via a Firefox add-on, or on the Web [http://dict.org].

                                                                                                                                                <entry xml:id=quot;alasiriquot;>
                                     DICT                                                               FreeDict                                  <form><orth>alasiri</orth></form>
                                                                                                                                                  <gramGrp><pos>n</pos></gramGrp>
   DICT (Dictionary Server Protocol; Faith and Martin 1997) is by now a                                                                           <sense>
                                                                                  FreeDict was founded in 2000 as an expression of the                <def>afternoon (period between 3
   well-established TCP-based query/response protocol that allows a client        natural open-source synergy with DICT: DICT provided                     p.m. and 5 p.m.)</def>
   to access definitions from a set of various dictionary databases. It           the platform for disseminating content of all kinds of          </sense>
                                                                                                                                                </entry>
   provides data in textual form, but it also has the potential of providing      dictionaries,    while    FreeDict   grouped     bilingual
   MIME-encoded content. The dictionary server software, dictd, is                dictionaries that could be disseminated on this platform.
   maintained and developed at SourceForge.                                       Later on, FreeDict adopted the TEI P4 XML format. At
   The DICT format is a plain text format with an accompanying index file         the moment, it has also basic support for TEI P5 (in the      <entry xml:id=quot;alasiriquot;>
   (an option of serving MIME content also exists).                               CVS only; this is work in progress).                             <form type=quot;Nquot;>
                                                                                                                                                      <orth>alasiri</orth>
   There is more than one way to query a DICT database: you can search
                                                                                FreeDict is the nexus of the following:                            </form>
   the definitions and the headwords, using regex-based criteria.                                                                                  <gramGrp><pos>n</pos></gramGrp>
                                                                                  XML, with its potential for creating well-structured             <sense>
   The clients can be free-standing desktop applications or they can be
                                                                                                                                                      <def>afternoon</def>
                                                                                  documents,
   integrated into editors or web browsers. DICT web gateways also exist.                                                                             <note type=quot;defquot;>period between
                                                                                  TEI P5, an encoding standard taking advantage of this
   The DICT project provides a list of clients and alternative servers.                                                                                     3 p.m. and 5 p.m.</note>
                                                                                                                                                   </sense>
                                                                                  potential,                                                    </entry>
                                                                                  the SourceForge repository as well as distribution and
                                                                                  content-management network,
                                                                                  the DICT distribution network: apart from being able to
                                                                                  query DICT servers straight from the desktop, Firefox
                                                                                                                                                <entry xml:id=quot;alasiriquot;>
                                                                                  users can also take advantage of an add-on client that           <form type=quot;Nquot;>
                                                                                  returns definitions for words highlighted on a web page             <orth>alasiri</orth>
                                                                                                                                                   </form>
                                                                                  (an example is shown to the left),                               <gramGrp><pos>n</pos></gramGrp>
                                                                                  FreeDict tools, as means to manipulate dictionaries and          <sense>
                                                                                                                                                      <cit type=quot;transquot;>
                                                                                  to create, among others, the DICT format (usable                    <quote>afternoon</quote>
   The screenshots above demonstrate the CSS “work view” of the dictionary        directly from DICT servers and by other dictionary-                    <def>period between 3 p.m. and
   (v. 0.4.1 of March 28th) and the way in which the Firefox add-on client        providing projects, e.g., StarDict or Open Dict); the                        5 p.m.</def>
   presents query results (the mismatches are due to the incomplete support                                                                           </cit>
                                                                                  build process provides targets for platforms other than          </sense>
   for TEI P5 in the FreeDict build system, originally designed for TEI P4;       DICT, e.g. the Evolutionary Dictionary or zbedic.             </entry>
   support for P5 got introduced only in mid-March).
                                                                                                                                               Each of the above is transformable
                                                                                Additionally:
                                                                                                                                               into   a     DICT-based      dictionary,
                           Why Swahili-English?                                   Lexical resources submitted to FreeDict will be able to
                                                                                                                                               accessible locally or via the Internet.
                                                                                  undergo further transformations, such as reversal or
     Just because we happen to be working on a Swahili-Polish-Swahili             concatenation, which means that work put into
     dictionary, and this is an offshoot of the testing phase of the project;     developing a single resource may well be re-used in
     we wanted to donate our test Swahili-English dictionary to FreeDict,         developing others.
     and this is how the entire adventure began. This dictionary (in versions
                                                                                  The project has its own distribution system, in the form
     0.3 and 0.4) replaced the earlier dictionary by the same name that
                                                                                                                                                And at every stage, the dictionary
                                                                                  of GNU/Linux packages.
     Beata created from freely available GPL-ed sources.
                                                                                                                                                content can be verified in the “work
                                                                                  Content published by FreeDict is guaranteed to be free.
     But our point is that any dictionary of any size can be submitted!                                                                         view”, provided by a CSS stylesheet,
                                                                                                                                                as shown above.
                                                           On the right, we present an           <entry>
                                                                                                  <form>
                                                           example of an entry in a                  <orth>adui</orth>
                                                                                                                                               <entry xml:id=quot;maaduiquot;>
                                                           dictionary with a somewhat                <ref target=quot;#maaduiquot;/>
                                                                                                                                                <form>
 <entry xml:id=quot;aduiquot;>                                                                            </form>
                                                           detailed amount of information                                                        <orth>maadui</orth>
  <form>                                                                                          <gramGrp><pos>n</pos></gramGrp>
                                                           and granularity thereof.                                                             </form>
   <orth>adui</orth>                                                                              <sense>
                                                                                                                                                <gramGrp>
  </form>                                                                                            <def>enemy</def>
                                                                                                                                                 <pos>n</pos>
  <xr type=quot;plural-formquot;>                                                                         </sense>
                                                                                                                                                </gramGrp>
                                                           The dictionary developer needs
   <ref target=quot;#maaduiquot;>maadui</ref>                                                             <sense>
                                                                                                                                                <sense>
  </xr>                                                                                            <def>opponent</def>
                                                           to fill in simple templates for                                                       <xr type=quot;plural-sensequot;>Plural of
  <gramGrp>                                                                                        <note type=quot;hintquot;>in games or
                                                           the relevant parts of speech.                                                           <ref target=quot;#aduiquot;>adui</ref>
   <pos>n</pos>                                                                                                     sports</note>
                                                                                                                                                 </xr>
  </gramGrp>                                                                                      </sense>
                                                                                                                                                 <def>enemy</def>
  <sense xml:id=quot;adui.1quot; n=quot;1quot;>                                                                  </entry>
                                                                                                                                                 <def>opponent <note type=quot;hintquot;>in
   <cit type=quot;transquot;>
                                                                                                                                                 games or sports</note></def>
                                                           Then, the predictable work is performed by XSLT scripts, which...
     <quote>enemy</quote>
                                                                                                                                                </sense>
   </cit>
                                                                                                                                               </entry>
  </sense>
  <sense xml:id=quot;adui.2quot; n=quot;2quot;>
  <cit type=quot;transquot;>
                                                                                                                                          (b) create new entries (in this case, a
                                                 (a) add XML structure
   <quote>opponent</quote>
                                                                                                                                          template plural entry, containing a
   <note type=quot;hintquot;>in games or sports</note>
                                                 to the entry created
  </sense>
                                                                                                                                          reference to the singular form)
                                                 by a developer
 </entry>



                                                                                                                       Selected references
          Developments planned for the near future
                                                                                     Faith, Rik and Martin, Brett. 1997. A Dictionary Server Protocol. Request for Comments: 2229
   After we reach version 0.5 with the cit/quote markup, we plan to start
                                                                                     (RFC #2229). Network Working Group. Available from ftp://ftp.isi.edu/in-notes/rfc2229.txt
   experimenting with dictionary reversal and concatenation (crossing).
                                                                                     TEI Consortium, eds. 2007. TEI P5: Guidelines for Electronic Text Encoding and Interchange.
   The support for LIFT (Lexicon Interchange FormaT) is next on the
                                                                                     Version 1.2.0. Last updated on October 31st 2008. TEI Consortium. Available from
   agenda.
                                                                                     http://www.tei-c.org/Guidelines/P5/
   More XML technology: tools for feeding dictionaries into, and querying
   their contents from, native XML databases.

Language Technologies for African Languages (AfLaT 2009) Workshop, EACL 2009, Athens                                                                                    31 March 2009

More Related Content

Similar to A Repository of Free Lexical Resources for African Languages: The Project and the Method

Firefox Extension Development | By JIIT OSDC
Firefox Extension Development | By JIIT OSDCFirefox Extension Development | By JIIT OSDC
Firefox Extension Development | By JIIT OSDC
Vaidik Kapoor
 
What is hot on the web right now - A W3C perspective
What is hot on the web right now - A W3C perspectiveWhat is hot on the web right now - A W3C perspective
What is hot on the web right now - A W3C perspective
Armin Haller
 
Sasaki webtechcon2010
Sasaki webtechcon2010Sasaki webtechcon2010
Sasaki webtechcon2010
Felix Sasaki
 
Fedora Overview
Fedora OverviewFedora Overview
Fedora Overview
eposthumus
 
C:\Users\User\Desktop\Eclipse Infocenter
C:\Users\User\Desktop\Eclipse InfocenterC:\Users\User\Desktop\Eclipse Infocenter
C:\Users\User\Desktop\Eclipse Infocenter
Suite Solutions
 
Cyflwyniad Bloc
Cyflwyniad BlocCyflwyniad Bloc
Cyflwyniad Bloc
canolfanbedwyr
 
Olf2016
Olf2016Olf2016
Olf2016
Dru Lavigne
 
BarCamp KL H20 Open Social Hackathon
BarCamp KL H20 Open Social HackathonBarCamp KL H20 Open Social Hackathon
BarCamp KL H20 Open Social Hackathon
marvin337
 
F/LOSS in Norwegian libraries
F/LOSS in Norwegian librariesF/LOSS in Norwegian libraries
F/LOSS in Norwegian libraries
Libriotech
 
solution Challenge design and flutter day.pptx
solution Challenge design and flutter day.pptxsolution Challenge design and flutter day.pptx
solution Challenge design and flutter day.pptx
GoogleDeveloperStude22
 
substrate: A framework to efficiently build blockchains
substrate: A framework to efficiently build blockchainssubstrate: A framework to efficiently build blockchains
substrate: A framework to efficiently build blockchains
servicesNitor
 
internet protocol
internet protocolinternet protocol
internet protocol
Afeef Musthafa
 
Tlf2016
Tlf2016Tlf2016
Tlf2016
Dru Lavigne
 
How to contribute to LibreOffice as a non-deloper
How to contribute to LibreOffice as a non-deloperHow to contribute to LibreOffice as a non-deloper
How to contribute to LibreOffice as a non-deloper
Heiko Tietze
 
IMS Learning Tools Interoperability @ UCLA
IMS Learning Tools Interoperability @ UCLAIMS Learning Tools Interoperability @ UCLA
IMS Learning Tools Interoperability @ UCLA
Charles Severance
 
Basic Introduction to Web Development
Basic Introduction to Web DevelopmentBasic Introduction to Web Development
Basic Introduction to Web Development
Burhan Khalid
 
The Rhizomer Semantic Content Management System
The Rhizomer Semantic Content Management SystemThe Rhizomer Semantic Content Management System
The Rhizomer Semantic Content Management System
Roberto García
 
The Semantic Web An Introduction
The Semantic Web An IntroductionThe Semantic Web An Introduction
The Semantic Web An Introduction
shaouy
 
HTML5 - Future of Web
HTML5 - Future of WebHTML5 - Future of Web
HTML5 - Future of Web
Mirza Asif
 
Accessibility, Automation and Metadata
Accessibility, Automation and MetadataAccessibility, Automation and Metadata
Accessibility, Automation and Metadata
lisbk
 

Similar to A Repository of Free Lexical Resources for African Languages: The Project and the Method (20)

Firefox Extension Development | By JIIT OSDC
Firefox Extension Development | By JIIT OSDCFirefox Extension Development | By JIIT OSDC
Firefox Extension Development | By JIIT OSDC
 
What is hot on the web right now - A W3C perspective
What is hot on the web right now - A W3C perspectiveWhat is hot on the web right now - A W3C perspective
What is hot on the web right now - A W3C perspective
 
Sasaki webtechcon2010
Sasaki webtechcon2010Sasaki webtechcon2010
Sasaki webtechcon2010
 
Fedora Overview
Fedora OverviewFedora Overview
Fedora Overview
 
C:\Users\User\Desktop\Eclipse Infocenter
C:\Users\User\Desktop\Eclipse InfocenterC:\Users\User\Desktop\Eclipse Infocenter
C:\Users\User\Desktop\Eclipse Infocenter
 
Cyflwyniad Bloc
Cyflwyniad BlocCyflwyniad Bloc
Cyflwyniad Bloc
 
Olf2016
Olf2016Olf2016
Olf2016
 
BarCamp KL H20 Open Social Hackathon
BarCamp KL H20 Open Social HackathonBarCamp KL H20 Open Social Hackathon
BarCamp KL H20 Open Social Hackathon
 
F/LOSS in Norwegian libraries
F/LOSS in Norwegian librariesF/LOSS in Norwegian libraries
F/LOSS in Norwegian libraries
 
solution Challenge design and flutter day.pptx
solution Challenge design and flutter day.pptxsolution Challenge design and flutter day.pptx
solution Challenge design and flutter day.pptx
 
substrate: A framework to efficiently build blockchains
substrate: A framework to efficiently build blockchainssubstrate: A framework to efficiently build blockchains
substrate: A framework to efficiently build blockchains
 
internet protocol
internet protocolinternet protocol
internet protocol
 
Tlf2016
Tlf2016Tlf2016
Tlf2016
 
How to contribute to LibreOffice as a non-deloper
How to contribute to LibreOffice as a non-deloperHow to contribute to LibreOffice as a non-deloper
How to contribute to LibreOffice as a non-deloper
 
IMS Learning Tools Interoperability @ UCLA
IMS Learning Tools Interoperability @ UCLAIMS Learning Tools Interoperability @ UCLA
IMS Learning Tools Interoperability @ UCLA
 
Basic Introduction to Web Development
Basic Introduction to Web DevelopmentBasic Introduction to Web Development
Basic Introduction to Web Development
 
The Rhizomer Semantic Content Management System
The Rhizomer Semantic Content Management SystemThe Rhizomer Semantic Content Management System
The Rhizomer Semantic Content Management System
 
The Semantic Web An Introduction
The Semantic Web An IntroductionThe Semantic Web An Introduction
The Semantic Web An Introduction
 
HTML5 - Future of Web
HTML5 - Future of WebHTML5 - Future of Web
HTML5 - Future of Web
 
Accessibility, Automation and Metadata
Accessibility, Automation and MetadataAccessibility, Automation and Metadata
Accessibility, Automation and Metadata
 

More from Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
Guy De Pauw
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
Guy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
Guy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
Guy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
Guy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
Guy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
Guy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Guy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
Guy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
Guy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
Guy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
Guy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
Guy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
Guy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
Guy De Pauw
 

More from Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 

A Repository of Free Lexical Resources for African Languages: The Project and the Method

  • 1. A repository of free lexical resources for African languages: the project and the method Piotr Bański Beata Wójtowicz Institute of English Studies, Dept. of African Languages and Cultures, University of Warsaw University of Warsaw E-mail: pkbanski@uw.edu.pl E-mail: b.wojtowicz@uw.edu.pl Below are the possible stages of Summary development of an example entry; from the simplest glossary to Our focus here is on FreeDict [http://www.freedict.org/], a project that has the potential to become home to, something close to a machine- among others, free bilingual dictionaries for African languages. The project is part of SourceForge.net. processable lexical database (we The dictionaries can be usable even in their early versions, which can be subject to further supervised skip xml:lang attributes). improvement as user feedback accumulates: “publish early, publish often”, in the open-source way. <entry> We demonstrate a possible process of dictionary development on the example of one of FreeDict dictionaries <form><orth>alasiri</orth></form> <def>afternoon</def> – Swahili-English xFried/Freedict Dictionary – the first FreeDict dictionary encoded according to the TEI P5 </entry> XML standard. The final product can be accessed via desktop clients, via a Firefox add-on, or on the Web [http://dict.org]. <entry xml:id=quot;alasiriquot;> DICT FreeDict <form><orth>alasiri</orth></form> <gramGrp><pos>n</pos></gramGrp> DICT (Dictionary Server Protocol; Faith and Martin 1997) is by now a <sense> FreeDict was founded in 2000 as an expression of the <def>afternoon (period between 3 well-established TCP-based query/response protocol that allows a client natural open-source synergy with DICT: DICT provided p.m. and 5 p.m.)</def> to access definitions from a set of various dictionary databases. It the platform for disseminating content of all kinds of </sense> </entry> provides data in textual form, but it also has the potential of providing dictionaries, while FreeDict grouped bilingual MIME-encoded content. The dictionary server software, dictd, is dictionaries that could be disseminated on this platform. maintained and developed at SourceForge. Later on, FreeDict adopted the TEI P4 XML format. At The DICT format is a plain text format with an accompanying index file the moment, it has also basic support for TEI P5 (in the <entry xml:id=quot;alasiriquot;> (an option of serving MIME content also exists). CVS only; this is work in progress). <form type=quot;Nquot;> <orth>alasiri</orth> There is more than one way to query a DICT database: you can search FreeDict is the nexus of the following: </form> the definitions and the headwords, using regex-based criteria. <gramGrp><pos>n</pos></gramGrp> XML, with its potential for creating well-structured <sense> The clients can be free-standing desktop applications or they can be <def>afternoon</def> documents, integrated into editors or web browsers. DICT web gateways also exist. <note type=quot;defquot;>period between TEI P5, an encoding standard taking advantage of this The DICT project provides a list of clients and alternative servers. 3 p.m. and 5 p.m.</note> </sense> potential, </entry> the SourceForge repository as well as distribution and content-management network, the DICT distribution network: apart from being able to query DICT servers straight from the desktop, Firefox <entry xml:id=quot;alasiriquot;> users can also take advantage of an add-on client that <form type=quot;Nquot;> returns definitions for words highlighted on a web page <orth>alasiri</orth> </form> (an example is shown to the left), <gramGrp><pos>n</pos></gramGrp> FreeDict tools, as means to manipulate dictionaries and <sense> <cit type=quot;transquot;> to create, among others, the DICT format (usable <quote>afternoon</quote> The screenshots above demonstrate the CSS “work view” of the dictionary directly from DICT servers and by other dictionary- <def>period between 3 p.m. and (v. 0.4.1 of March 28th) and the way in which the Firefox add-on client providing projects, e.g., StarDict or Open Dict); the 5 p.m.</def> presents query results (the mismatches are due to the incomplete support </cit> build process provides targets for platforms other than </sense> for TEI P5 in the FreeDict build system, originally designed for TEI P4; DICT, e.g. the Evolutionary Dictionary or zbedic. </entry> support for P5 got introduced only in mid-March). Each of the above is transformable Additionally: into a DICT-based dictionary, Why Swahili-English? Lexical resources submitted to FreeDict will be able to accessible locally or via the Internet. undergo further transformations, such as reversal or Just because we happen to be working on a Swahili-Polish-Swahili concatenation, which means that work put into dictionary, and this is an offshoot of the testing phase of the project; developing a single resource may well be re-used in we wanted to donate our test Swahili-English dictionary to FreeDict, developing others. and this is how the entire adventure began. This dictionary (in versions The project has its own distribution system, in the form 0.3 and 0.4) replaced the earlier dictionary by the same name that And at every stage, the dictionary of GNU/Linux packages. Beata created from freely available GPL-ed sources. content can be verified in the “work Content published by FreeDict is guaranteed to be free. But our point is that any dictionary of any size can be submitted! view”, provided by a CSS stylesheet, as shown above. On the right, we present an <entry> <form> example of an entry in a <orth>adui</orth> <entry xml:id=quot;maaduiquot;> dictionary with a somewhat <ref target=quot;#maaduiquot;/> <form> <entry xml:id=quot;aduiquot;> </form> detailed amount of information <orth>maadui</orth> <form> <gramGrp><pos>n</pos></gramGrp> and granularity thereof. </form> <orth>adui</orth> <sense> <gramGrp> </form> <def>enemy</def> <pos>n</pos> <xr type=quot;plural-formquot;> </sense> </gramGrp> The dictionary developer needs <ref target=quot;#maaduiquot;>maadui</ref> <sense> <sense> </xr> <def>opponent</def> to fill in simple templates for <xr type=quot;plural-sensequot;>Plural of <gramGrp> <note type=quot;hintquot;>in games or the relevant parts of speech. <ref target=quot;#aduiquot;>adui</ref> <pos>n</pos> sports</note> </xr> </gramGrp> </sense> <def>enemy</def> <sense xml:id=quot;adui.1quot; n=quot;1quot;> </entry> <def>opponent <note type=quot;hintquot;>in <cit type=quot;transquot;> games or sports</note></def> Then, the predictable work is performed by XSLT scripts, which... <quote>enemy</quote> </sense> </cit> </entry> </sense> <sense xml:id=quot;adui.2quot; n=quot;2quot;> <cit type=quot;transquot;> (b) create new entries (in this case, a (a) add XML structure <quote>opponent</quote> template plural entry, containing a <note type=quot;hintquot;>in games or sports</note> to the entry created </sense> reference to the singular form) by a developer </entry> Selected references Developments planned for the near future Faith, Rik and Martin, Brett. 1997. A Dictionary Server Protocol. Request for Comments: 2229 After we reach version 0.5 with the cit/quote markup, we plan to start (RFC #2229). Network Working Group. Available from ftp://ftp.isi.edu/in-notes/rfc2229.txt experimenting with dictionary reversal and concatenation (crossing). TEI Consortium, eds. 2007. TEI P5: Guidelines for Electronic Text Encoding and Interchange. The support for LIFT (Lexicon Interchange FormaT) is next on the Version 1.2.0. Last updated on October 31st 2008. TEI Consortium. Available from agenda. http://www.tei-c.org/Guidelines/P5/ More XML technology: tools for feeding dictionaries into, and querying their contents from, native XML databases. Language Technologies for African Languages (AfLaT 2009) Workshop, EACL 2009, Athens 31 March 2009