SlideShare a Scribd company logo
1 of 18
Download to read offline
Open-Content Text Corpus
                        for African languages

               Piotr Bański                                       Beata Wójtowicz

       Institute of English Studies,           Dept. of African Languages and Cultures,
          University of Warsaw                            University of Warsaw
          pkbanski@uw.edu.pl                            b.wojtowicz@uw.edu.pl




We wish to acknowledge support from grant # N104 050437
from the Polish Ministry of Science and Higher Education.




                                AfLaT, Valetta, Malta, May 2010
Data islands
Genesis of data islands
 • it's difficult to give (my preciousss...)
 • it's difficult to take (the NIH syndrome: a stranger in my house??)

Data islands are harmful, particularly for languages
 • with lower amount of financial support,
 • lesser chances of growing armies of their own researchers,
 • endangered.

Let us fight them – in such a way that will address the broadest range of problems
in a single move:
  • preserve data,
  • give it a chance to improve,
  • make it possible to train researchers,
  • at a low cost.




P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010           2 of 18
Data islands: it's mine, mine...

Various reasons:
 • sources closed (but have you asked for re-licensing?)
 • unsustainable format (Word, etc.)
 • it's my sweat and blood (read: time and expertise), so why should I give it out
   for free? or
 • I don't want them to have a head start... let them sweat (and pay) too; or
 • it's too small, no one would be interested.




P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010            3 of 18
Data islands: the NIH syndrome
                                     NIH = Not Invented Here


Sometimes it's quite understandable:
  • unclear methodology,
  • irreproducible results,
  • “wrong” theoretical approach.


But if the above is controlled for, the NIH remains an irreflexive, bad habit with no
justification.




P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010              4 of 18
Let's fight that: FreeDict '09
We've already tried once here, with meagre success within this community:

Bański + Wójtowicz (AfLaT 2009). A Repository of Free Lexical Resources for
African Languages: The Project and the Method.

                                        http://freedict.org

BTW, FreeDict is doing well, and our invitation stands.

As far as African languages are concerned, we have added
 • a small Swahili-Polish dictionary, and
 • the Swahili-English dictionary is getting upgraded with Arabic script variants of
    headwords.




P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010            5 of 18
Let's try again: OCTC '10
OCTC = Open-Content Text Corpus

This is our recipe for difficulties in giving, and difficulties in taking:
 • a platform to store your data securely, and
 • to make it possible for you to have others improve on it, for mutual benefit;
 • a platform to distribute your research,
 • to present your methodology transparently, and
 • without injecting your “wrong” theory into the data;
 • a platform to give you new perspectives for research, and
 • to let you get engaged in cooperation with other researchers in your area.

(Lots of promises, let's see.)

                               http://OCTC.sourceforge.net/


P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010            6 of 18
Gross structure of the OCTC
Modules:
 • lg/ – monolingual subcorpora (e.g. Polish, Swahili, etc.)
 • align/ – aligned bi/multilingual parallel subcorpora (e.g. Polish-Swahili, etc.)
 • core/ – the trunk of the corpus (schemas, tools, corpus-level metadata)

For the time being, the OCTC contains “seeds” (minimal subcorpora in the form of
the Universal Declaration of Human Rights) for 55 languages.

→ this is very important, because these seeds are where individual researchers
can continue from:
  • they have a working model to base the format of their data on,
  • they are likely to see tools that will operate on what they contribute from day 1.

In some cases, the seeds have already grown (the Polish and Swahili subcorpora,
Czech data coming soon).

The UDHR subcorpus is at the same time a parallel subcorpus (!)


P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010               7 of 18
Format of the OCTC
The OCTC is encoded in XML,
following the recommendations of the Text Encoding Initiative (TEI)

TEI:
 • a de facto standard for text encoding in the Humanities:
 • manuscript encoding through
 • dictionary encoding to
 • corpus encoding and
 • archive encoding (e.g. in the Gutenberg Project).

  • The first corpus encoded in the TEI: BNC (British National Corpus);
  • Many others followed, including one we are involved in, namely the NCP/NKJP
    (National Corpus of Polish – 109 billion segments, already available for
    searching);
  • Piotr was the XML architect for the NCP; the format of the OCTC is an
    extension of his ideas and the input of the NCP team.



P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010        8 of 18
OCTC for Africanists (I)
Part of the SourceForge,
 • a stable environment for creation and distribution of data and tools,
 • with nearly 20 mirrors world-wide (a disk crash is not a problem...)
→ your data is safe here, and you don't need to worry about the distribution.


Open-source tools, open-content data
→ no methodological doubts, you can see what has been done (data manipulation
is transparent); the accompanying tools are open-source as well.


Version-control mechanism (Subversion)
→ it is possible to take snapshots of the corpus before any measurements are
performed (reproducibility of results!)




P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010           9 of 18
OCTC for Africanists (II)

Stand-off annotation format:
  • the source text is kept separate, in a nearly un-annotated form,
  • annotations (in separate files) form layers over the source data; they provide
    views of the data;
  • there is thus no danger of “polluting” the source data with the “wrong” theory –
    you can always add your own segmentation, your own layer of POS markup,
    your favourite syntactic modelling, etc...


Licensing (GNU General Public License)
  • a guarantee that your data will remain free and
  • that you will be able to use it after others have improved/enlarged it
   (the same is true about your tools).



P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010               10 of 18
A wealth of research directions
  • just data storage
  • data interactions (alignment)
  • tool testing/training
  • methodology testing (e.g. memory-based approaches)
    ◦taggers,
    ◦sentencers
    ◦segmenters
  • research/programming environment testing: UIMA, GATE, ...?
  • applications in
    ◦lexicography,
    ◦machine translation,
    ◦translating aids (see below),
    ◦information extraction,
    ◦etc., etc. (whatever corpora are good for)

P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010   11 of 18
Licensing issues
GNU General Public License:
 • it's free as in “free speech”
 • you can do anything with a GPL-ed resource, on one condition:
 • the GPL gene stays in the family (once it's made free, the resource, and its
   descendants, have to remain free)


This is the only sensible way to handle fragile resources from “non-central”, “low-
density”, “under-represented” languages – otherwise
 • data might be lost irretrievably,
 • methods of creating it are not transparent, and therefore experiments are not
    reproducible,
 • when GPL-ed, data has a chance to get refined and come back to you for
    further processing.



P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010             12 of 18
GNU GPL questions
Will they sell my work??
 • will they? after all, people can get it for free
 • but if they will, they are bound to keep the information about the authorship

But I am entitled to make a living on this!
 • by all means – if you can sell it, sell it, but perhaps consider releasing an earlier
   version of your resources under the GPL
   • (why?) for those who speak the language that you are able to sell, so that the
      language also gets support and maybe grows its own specialists, thanks
      partially to the data you release.

How else will I earn money?
 • there's more than one way (consulting, making your name known in the
   community: academic/industry career).

I want to co-operate with a large company
  • but then you do not control your data's destiny – your data and effort may get
    lost due to the company's decision: Scannell, 2008, on Irish support dropped by
    Apple and Dzongkha support dropped by Microsoft.
P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010               13 of 18
Community-building issues

Sourceforge has all the community-building facilities one could dream of:
 • mailing lists
 • bulletin board
 • bug/issue tracker
 • project newsfeed
 • wiki

Administration:
 • we want to stay in the background, giving control of individual subprojects to the
   particular subcommunities;
 • we want to watch over the format and over the licensing, and that is basically all




P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010              14 of 18
The Omega adventure – recipe
  • take one small FreeDict dictionary
    (Swahili-Polish, edited years ago by Beata Wójtowicz and remaining a data-
    island for several years, until I converted it into the TEI for FreeDict)

→ prepare a flat TSV glossary


  • take one OCTC aligned text for the same pair of languages
    (Swahili-Polish aligned text of the Universal Declaration of Human Rights)

→ convert it to TMX bitext (by an OCTC tool; all of that can be verified in the SVN)


  • plug both into OmegaT+ – an open-source translation aid

→ see the next slide for a little demo:



P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010            15 of 18
Proof of concept: OmegaT+




P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010   16 of 18
Conclusion



                                   Perspectives: open-ended!




We'd like to co-operate with people whom we more-or-less know, so...

                              ...talk to us during the break, ok? :-)




P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010          17 of 18
Thank you

                  http://OCTC.sourceforge.net/




P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010   18 of 18

More Related Content

What's hot

DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningPier Luca Lanzi
 
DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionPier Luca Lanzi
 
Finding and managing engineering information … and the challenge of publishin...
Finding and managing engineering information … and the challenge of publishin...Finding and managing engineering information … and the challenge of publishin...
Finding and managing engineering information … and the challenge of publishin...Thomas Hapke
 
Information search in databases
Information search in databasesInformation search in databases
Information search in databaseswerro33
 
TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...
TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...
TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...Olaf Janssen
 
Death To Paper Drupal Fresno
Death To Paper Drupal FresnoDeath To Paper Drupal Fresno
Death To Paper Drupal FresnoBob Kepford
 

What's hot (7)

DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data Mining
 
DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course Introduction
 
Finding and managing engineering information … and the challenge of publishin...
Finding and managing engineering information … and the challenge of publishin...Finding and managing engineering information … and the challenge of publishin...
Finding and managing engineering information … and the challenge of publishin...
 
Metadata
MetadataMetadata
Metadata
 
Information search in databases
Information search in databasesInformation search in databases
Information search in databases
 
TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...
TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...
TheEuropeanLibrary.org - a (non technical) case study. Olaf Janssen lecturing...
 
Death To Paper Drupal Fresno
Death To Paper Drupal FresnoDeath To Paper Drupal Fresno
Death To Paper Drupal Fresno
 

Viewers also liked

L'agenda segreta del lobbista 13
L'agenda segreta del lobbista 13L'agenda segreta del lobbista 13
L'agenda segreta del lobbista 13Reti
 
1000 Fichas Enero A Dic
1000 Fichas Enero A Dic1000 Fichas Enero A Dic
1000 Fichas Enero A DicAdalberto
 
Informe Red En Rojos
Informe Red En RojosInforme Red En Rojos
Informe Red En RojosAdalberto
 
Parvularia Lily
Parvularia LilyParvularia Lily
Parvularia LilyAdalberto
 
00 Informe Mensual Marzo 2009
00 Informe Mensual Marzo 200900 Informe Mensual Marzo 2009
00 Informe Mensual Marzo 2009Adalberto
 
00 Alternativa, Secciones Por Grado
00 Alternativa, Secciones Por Grado00 Alternativa, Secciones Por Grado
00 Alternativa, Secciones Por GradoAdalberto
 
04/07/2011 DWC+ Teleclass: Blogging for Business: Say Buh-Bye to the Brochure...
04/07/2011 DWC+ Teleclass: Blogging for Business: Say Buh-Bye to the Brochure...04/07/2011 DWC+ Teleclass: Blogging for Business: Say Buh-Bye to the Brochure...
04/07/2011 DWC+ Teleclass: Blogging for Business: Say Buh-Bye to the Brochure...Erin Murphy
 
Agosto Avance De Lectura I, Ii Y Iii Ciclo
Agosto Avance De Lectura I, Ii  Y Iii CicloAgosto Avance De Lectura I, Ii  Y Iii Ciclo
Agosto Avance De Lectura I, Ii Y Iii CicloAdalberto
 
Our school by nerea, rocío and amanda
Our school by nerea, rocío and amandaOur school by nerea, rocío and amanda
Our school by nerea, rocío and amandaanglimo
 
Procedimento legislativo e potere di emendamento
Procedimento legislativo e potere di emendamentoProcedimento legislativo e potere di emendamento
Procedimento legislativo e potere di emendamentoReti
 
Towards The Manding Corpus: Texts Selection Principles and Metatext Markup
Towards The Manding Corpus: Texts Selection Principles and Metatext MarkupTowards The Manding Corpus: Texts Selection Principles and Metatext Markup
Towards The Manding Corpus: Texts Selection Principles and Metatext MarkupGuy De Pauw
 
Logros, limitantes, compromisos
Logros, limitantes, compromisosLogros, limitantes, compromisos
Logros, limitantes, compromisosAdalberto
 
Agenda istituzionale 05 10 2015/09 10 2015
Agenda istituzionale 05 10 2015/09 10 2015Agenda istituzionale 05 10 2015/09 10 2015
Agenda istituzionale 05 10 2015/09 10 2015Reti
 
La comunicazione politica
La comunicazione politicaLa comunicazione politica
La comunicazione politicaReti
 
World Lotteries Association - Magazine - Spring 2014
World Lotteries Association - Magazine - Spring 2014World Lotteries Association - Magazine - Spring 2014
World Lotteries Association - Magazine - Spring 2014Oliver Grave
 
Presentazione monitoraggio nazionale e regionale
Presentazione monitoraggio nazionale e regionalePresentazione monitoraggio nazionale e regionale
Presentazione monitoraggio nazionale e regionaleReti
 

Viewers also liked (20)

L'agenda segreta del lobbista 13
L'agenda segreta del lobbista 13L'agenda segreta del lobbista 13
L'agenda segreta del lobbista 13
 
1000 Fichas Enero A Dic
1000 Fichas Enero A Dic1000 Fichas Enero A Dic
1000 Fichas Enero A Dic
 
Informe Red En Rojos
Informe Red En RojosInforme Red En Rojos
Informe Red En Rojos
 
Parvularia Lily
Parvularia LilyParvularia Lily
Parvularia Lily
 
00 Informe Mensual Marzo 2009
00 Informe Mensual Marzo 200900 Informe Mensual Marzo 2009
00 Informe Mensual Marzo 2009
 
00 Alternativa, Secciones Por Grado
00 Alternativa, Secciones Por Grado00 Alternativa, Secciones Por Grado
00 Alternativa, Secciones Por Grado
 
Ana
AnaAna
Ana
 
Amaneceres
AmaneceresAmaneceres
Amaneceres
 
04/07/2011 DWC+ Teleclass: Blogging for Business: Say Buh-Bye to the Brochure...
04/07/2011 DWC+ Teleclass: Blogging for Business: Say Buh-Bye to the Brochure...04/07/2011 DWC+ Teleclass: Blogging for Business: Say Buh-Bye to the Brochure...
04/07/2011 DWC+ Teleclass: Blogging for Business: Say Buh-Bye to the Brochure...
 
Agosto Avance De Lectura I, Ii Y Iii Ciclo
Agosto Avance De Lectura I, Ii  Y Iii CicloAgosto Avance De Lectura I, Ii  Y Iii Ciclo
Agosto Avance De Lectura I, Ii Y Iii Ciclo
 
Mashable Slides
Mashable SlidesMashable Slides
Mashable Slides
 
Our school by nerea, rocío and amanda
Our school by nerea, rocío and amandaOur school by nerea, rocío and amanda
Our school by nerea, rocío and amanda
 
Procedimento legislativo e potere di emendamento
Procedimento legislativo e potere di emendamentoProcedimento legislativo e potere di emendamento
Procedimento legislativo e potere di emendamento
 
Towards The Manding Corpus: Texts Selection Principles and Metatext Markup
Towards The Manding Corpus: Texts Selection Principles and Metatext MarkupTowards The Manding Corpus: Texts Selection Principles and Metatext Markup
Towards The Manding Corpus: Texts Selection Principles and Metatext Markup
 
Logros, limitantes, compromisos
Logros, limitantes, compromisosLogros, limitantes, compromisos
Logros, limitantes, compromisos
 
Agenda istituzionale 05 10 2015/09 10 2015
Agenda istituzionale 05 10 2015/09 10 2015Agenda istituzionale 05 10 2015/09 10 2015
Agenda istituzionale 05 10 2015/09 10 2015
 
Dr George Taleporos - Self-Directed Approach A personal perspective
Dr George Taleporos - Self-Directed Approach A personal perspectiveDr George Taleporos - Self-Directed Approach A personal perspective
Dr George Taleporos - Self-Directed Approach A personal perspective
 
La comunicazione politica
La comunicazione politicaLa comunicazione politica
La comunicazione politica
 
World Lotteries Association - Magazine - Spring 2014
World Lotteries Association - Magazine - Spring 2014World Lotteries Association - Magazine - Spring 2014
World Lotteries Association - Magazine - Spring 2014
 
Presentazione monitoraggio nazionale e regionale
Presentazione monitoraggio nazionale e regionalePresentazione monitoraggio nazionale e regionale
Presentazione monitoraggio nazionale e regionale
 

Similar to Open-Content Text Corpus Provides Platform for African Language Data

Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)
Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)
Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)Vladimir Alexiev, PhD, PMP
 
EGU 2013 Splinter Meeting: FOSS in the Geosciences
EGU 2013 Splinter Meeting: FOSS in the Geosciences EGU 2013 Splinter Meeting: FOSS in the Geosciences
EGU 2013 Splinter Meeting: FOSS in the Geosciences Peter Löwe
 
Baltussen_Europeana_WWI_Crowdsourcing_Waisda_FINAL.ppt
Baltussen_Europeana_WWI_Crowdsourcing_Waisda_FINAL.pptBaltussen_Europeana_WWI_Crowdsourcing_Waisda_FINAL.ppt
Baltussen_Europeana_WWI_Crowdsourcing_Waisda_FINAL.pptLotte Belice Baltussen
 
EGU 2013: Splinter Meeting: Free and Open Source Software (FOSS) in the Geosc...
EGU 2013: Splinter Meeting: Free and Open Source Software (FOSS) in the Geosc...EGU 2013: Splinter Meeting: Free and Open Source Software (FOSS) in the Geosc...
EGU 2013: Splinter Meeting: Free and Open Source Software (FOSS) in the Geosc...Peter Löwe
 
Next 10 years slides v1
Next 10 years slides v1Next 10 years slides v1
Next 10 years slides v1FIAT/IFTA
 
Open data and Open Science
Open data and Open ScienceOpen data and Open Science
Open data and Open Sciencepetermurrayrust
 
Lockss usdocs-dl cfall10
Lockss usdocs-dl cfall10Lockss usdocs-dl cfall10
Lockss usdocs-dl cfall10James Jacobs
 
Impact of open source search on the intelligence community
Impact of open source search on the intelligence communityImpact of open source search on the intelligence community
Impact of open source search on the intelligence communityLucidworks (Archived)
 
Setting a Precedent with Open Resources Development in English for Specific A...
Setting a Precedent with Open Resources Development in English for Specific A...Setting a Precedent with Open Resources Development in English for Specific A...
Setting a Precedent with Open Resources Development in English for Specific A...Alannah Fitzgerald
 
GIS Day 2015: Geoinformatics, Open Source and Videos - a library perspective
GIS Day 2015: Geoinformatics, Open Source and Videos - a library perspectiveGIS Day 2015: Geoinformatics, Open Source and Videos - a library perspective
GIS Day 2015: Geoinformatics, Open Source and Videos - a library perspectivePeter Löwe
 
Resources and Lessons on Open Data from the World Bank
Resources and Lessons on Open Data from the World BankResources and Lessons on Open Data from the World Bank
Resources and Lessons on Open Data from the World Banktariqkhokhar
 
Open Source Systems in Justice
Open Source Systems in JusticeOpen Source Systems in Justice
Open Source Systems in JusticeMatthias Stürmer
 
State of art of Open Data in Europe
State of art of Open Data in EuropeState of art of Open Data in Europe
State of art of Open Data in EuropeliberTIC
 
Ontotext Cultural Heritage and Digital Humanities Projects
Ontotext Cultural Heritage and Digital Humanities ProjectsOntotext Cultural Heritage and Digital Humanities Projects
Ontotext Cultural Heritage and Digital Humanities ProjectsVladimir Alexiev, PhD, PMP
 
Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfIMPACT Centre of Competence
 

Similar to Open-Content Text Corpus Provides Platform for African Language Data (20)

Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)
Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)
Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)
 
EGU 2013 Splinter Meeting: FOSS in the Geosciences
EGU 2013 Splinter Meeting: FOSS in the Geosciences EGU 2013 Splinter Meeting: FOSS in the Geosciences
EGU 2013 Splinter Meeting: FOSS in the Geosciences
 
Baltussen_Europeana_WWI_Crowdsourcing_Waisda_FINAL.ppt
Baltussen_Europeana_WWI_Crowdsourcing_Waisda_FINAL.pptBaltussen_Europeana_WWI_Crowdsourcing_Waisda_FINAL.ppt
Baltussen_Europeana_WWI_Crowdsourcing_Waisda_FINAL.ppt
 
EGU 2013: Splinter Meeting: Free and Open Source Software (FOSS) in the Geosc...
EGU 2013: Splinter Meeting: Free and Open Source Software (FOSS) in the Geosc...EGU 2013: Splinter Meeting: Free and Open Source Software (FOSS) in the Geosc...
EGU 2013: Splinter Meeting: Free and Open Source Software (FOSS) in the Geosc...
 
Next 10 years slides v1
Next 10 years slides v1Next 10 years slides v1
Next 10 years slides v1
 
Kyra Pollitt
Kyra PollittKyra Pollitt
Kyra Pollitt
 
Anita Eppelin: Open Access and Open Data in Germany: current political develo...
Anita Eppelin: Open Access and Open Data in Germany: current political develo...Anita Eppelin: Open Access and Open Data in Germany: current political develo...
Anita Eppelin: Open Access and Open Data in Germany: current political develo...
 
Challenges for Linked Data in Japan
Challenges for Linked Data in JapanChallenges for Linked Data in Japan
Challenges for Linked Data in Japan
 
Open data and Open Science
Open data and Open ScienceOpen data and Open Science
Open data and Open Science
 
Lockss usdocs-dl cfall10
Lockss usdocs-dl cfall10Lockss usdocs-dl cfall10
Lockss usdocs-dl cfall10
 
Impact of open source search on the intelligence community
Impact of open source search on the intelligence communityImpact of open source search on the intelligence community
Impact of open source search on the intelligence community
 
Institutional OA policy at the University of Liège: how to set up a successfu...
Institutional OA policy at the University of Liège: how to set up a successfu...Institutional OA policy at the University of Liège: how to set up a successfu...
Institutional OA policy at the University of Liège: how to set up a successfu...
 
Setting a Precedent with Open Resources Development in English for Specific A...
Setting a Precedent with Open Resources Development in English for Specific A...Setting a Precedent with Open Resources Development in English for Specific A...
Setting a Precedent with Open Resources Development in English for Specific A...
 
GIS Day 2015: Geoinformatics, Open Source and Videos - a library perspective
GIS Day 2015: Geoinformatics, Open Source and Videos - a library perspectiveGIS Day 2015: Geoinformatics, Open Source and Videos - a library perspective
GIS Day 2015: Geoinformatics, Open Source and Videos - a library perspective
 
Resources and Lessons on Open Data from the World Bank
Resources and Lessons on Open Data from the World BankResources and Lessons on Open Data from the World Bank
Resources and Lessons on Open Data from the World Bank
 
Open Source Systems in Justice
Open Source Systems in JusticeOpen Source Systems in Justice
Open Source Systems in Justice
 
Oer movement carolina 2h30
Oer movement carolina 2h30Oer movement carolina 2h30
Oer movement carolina 2h30
 
State of art of Open Data in Europe
State of art of Open Data in EuropeState of art of Open Data in Europe
State of art of Open Data in Europe
 
Ontotext Cultural Heritage and Digital Humanities Projects
Ontotext Cultural Heritage and Digital Humanities ProjectsOntotext Cultural Heritage and Digital Humanities Projects
Ontotext Cultural Heritage and Digital Humanities Projects
 
Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
 

More from Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageGuy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusGuy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of SantomeGuy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTGuy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 

More from Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

Open-Content Text Corpus Provides Platform for African Language Data

  • 1. Open-Content Text Corpus for African languages Piotr Bański Beata Wójtowicz Institute of English Studies, Dept. of African Languages and Cultures, University of Warsaw University of Warsaw pkbanski@uw.edu.pl b.wojtowicz@uw.edu.pl We wish to acknowledge support from grant # N104 050437 from the Polish Ministry of Science and Higher Education. AfLaT, Valetta, Malta, May 2010
  • 2. Data islands Genesis of data islands • it's difficult to give (my preciousss...) • it's difficult to take (the NIH syndrome: a stranger in my house??) Data islands are harmful, particularly for languages • with lower amount of financial support, • lesser chances of growing armies of their own researchers, • endangered. Let us fight them – in such a way that will address the broadest range of problems in a single move: • preserve data, • give it a chance to improve, • make it possible to train researchers, • at a low cost. P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 2 of 18
  • 3. Data islands: it's mine, mine... Various reasons: • sources closed (but have you asked for re-licensing?) • unsustainable format (Word, etc.) • it's my sweat and blood (read: time and expertise), so why should I give it out for free? or • I don't want them to have a head start... let them sweat (and pay) too; or • it's too small, no one would be interested. P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 3 of 18
  • 4. Data islands: the NIH syndrome NIH = Not Invented Here Sometimes it's quite understandable: • unclear methodology, • irreproducible results, • “wrong” theoretical approach. But if the above is controlled for, the NIH remains an irreflexive, bad habit with no justification. P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 4 of 18
  • 5. Let's fight that: FreeDict '09 We've already tried once here, with meagre success within this community: Bański + Wójtowicz (AfLaT 2009). A Repository of Free Lexical Resources for African Languages: The Project and the Method. http://freedict.org BTW, FreeDict is doing well, and our invitation stands. As far as African languages are concerned, we have added • a small Swahili-Polish dictionary, and • the Swahili-English dictionary is getting upgraded with Arabic script variants of headwords. P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 5 of 18
  • 6. Let's try again: OCTC '10 OCTC = Open-Content Text Corpus This is our recipe for difficulties in giving, and difficulties in taking: • a platform to store your data securely, and • to make it possible for you to have others improve on it, for mutual benefit; • a platform to distribute your research, • to present your methodology transparently, and • without injecting your “wrong” theory into the data; • a platform to give you new perspectives for research, and • to let you get engaged in cooperation with other researchers in your area. (Lots of promises, let's see.) http://OCTC.sourceforge.net/ P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 6 of 18
  • 7. Gross structure of the OCTC Modules: • lg/ – monolingual subcorpora (e.g. Polish, Swahili, etc.) • align/ – aligned bi/multilingual parallel subcorpora (e.g. Polish-Swahili, etc.) • core/ – the trunk of the corpus (schemas, tools, corpus-level metadata) For the time being, the OCTC contains “seeds” (minimal subcorpora in the form of the Universal Declaration of Human Rights) for 55 languages. → this is very important, because these seeds are where individual researchers can continue from: • they have a working model to base the format of their data on, • they are likely to see tools that will operate on what they contribute from day 1. In some cases, the seeds have already grown (the Polish and Swahili subcorpora, Czech data coming soon). The UDHR subcorpus is at the same time a parallel subcorpus (!) P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 7 of 18
  • 8. Format of the OCTC The OCTC is encoded in XML, following the recommendations of the Text Encoding Initiative (TEI) TEI: • a de facto standard for text encoding in the Humanities: • manuscript encoding through • dictionary encoding to • corpus encoding and • archive encoding (e.g. in the Gutenberg Project). • The first corpus encoded in the TEI: BNC (British National Corpus); • Many others followed, including one we are involved in, namely the NCP/NKJP (National Corpus of Polish – 109 billion segments, already available for searching); • Piotr was the XML architect for the NCP; the format of the OCTC is an extension of his ideas and the input of the NCP team. P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 8 of 18
  • 9. OCTC for Africanists (I) Part of the SourceForge, • a stable environment for creation and distribution of data and tools, • with nearly 20 mirrors world-wide (a disk crash is not a problem...) → your data is safe here, and you don't need to worry about the distribution. Open-source tools, open-content data → no methodological doubts, you can see what has been done (data manipulation is transparent); the accompanying tools are open-source as well. Version-control mechanism (Subversion) → it is possible to take snapshots of the corpus before any measurements are performed (reproducibility of results!) P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 9 of 18
  • 10. OCTC for Africanists (II) Stand-off annotation format: • the source text is kept separate, in a nearly un-annotated form, • annotations (in separate files) form layers over the source data; they provide views of the data; • there is thus no danger of “polluting” the source data with the “wrong” theory – you can always add your own segmentation, your own layer of POS markup, your favourite syntactic modelling, etc... Licensing (GNU General Public License) • a guarantee that your data will remain free and • that you will be able to use it after others have improved/enlarged it (the same is true about your tools). P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 10 of 18
  • 11. A wealth of research directions • just data storage • data interactions (alignment) • tool testing/training • methodology testing (e.g. memory-based approaches) ◦taggers, ◦sentencers ◦segmenters • research/programming environment testing: UIMA, GATE, ...? • applications in ◦lexicography, ◦machine translation, ◦translating aids (see below), ◦information extraction, ◦etc., etc. (whatever corpora are good for) P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 11 of 18
  • 12. Licensing issues GNU General Public License: • it's free as in “free speech” • you can do anything with a GPL-ed resource, on one condition: • the GPL gene stays in the family (once it's made free, the resource, and its descendants, have to remain free) This is the only sensible way to handle fragile resources from “non-central”, “low- density”, “under-represented” languages – otherwise • data might be lost irretrievably, • methods of creating it are not transparent, and therefore experiments are not reproducible, • when GPL-ed, data has a chance to get refined and come back to you for further processing. P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 12 of 18
  • 13. GNU GPL questions Will they sell my work?? • will they? after all, people can get it for free • but if they will, they are bound to keep the information about the authorship But I am entitled to make a living on this! • by all means – if you can sell it, sell it, but perhaps consider releasing an earlier version of your resources under the GPL • (why?) for those who speak the language that you are able to sell, so that the language also gets support and maybe grows its own specialists, thanks partially to the data you release. How else will I earn money? • there's more than one way (consulting, making your name known in the community: academic/industry career). I want to co-operate with a large company • but then you do not control your data's destiny – your data and effort may get lost due to the company's decision: Scannell, 2008, on Irish support dropped by Apple and Dzongkha support dropped by Microsoft. P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 13 of 18
  • 14. Community-building issues Sourceforge has all the community-building facilities one could dream of: • mailing lists • bulletin board • bug/issue tracker • project newsfeed • wiki Administration: • we want to stay in the background, giving control of individual subprojects to the particular subcommunities; • we want to watch over the format and over the licensing, and that is basically all P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 14 of 18
  • 15. The Omega adventure – recipe • take one small FreeDict dictionary (Swahili-Polish, edited years ago by Beata Wójtowicz and remaining a data- island for several years, until I converted it into the TEI for FreeDict) → prepare a flat TSV glossary • take one OCTC aligned text for the same pair of languages (Swahili-Polish aligned text of the Universal Declaration of Human Rights) → convert it to TMX bitext (by an OCTC tool; all of that can be verified in the SVN) • plug both into OmegaT+ – an open-source translation aid → see the next slide for a little demo: P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 15 of 18
  • 16. Proof of concept: OmegaT+ P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 16 of 18
  • 17. Conclusion Perspectives: open-ended! We'd like to co-operate with people whom we more-or-less know, so... ...talk to us during the break, ok? :-) P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 17 of 18
  • 18. Thank you http://OCTC.sourceforge.net/ P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 18 of 18