Open-Content Text Corpus Provides Platform for African Language Data

Open-Content Text Corpus
for African languages

Piotr Bański Beata Wójtowicz

Institute of English Studies, Dept. of African Languages and Cultures,
University of Warsaw University of Warsaw
pkbanski@uw.edu.pl b.wojtowicz@uw.edu.pl

We wish to acknowledge support from grant # N104 050437
from the Polish Ministry of Science and Higher Education.

AfLaT, Valetta, Malta, May 2010

Data islands
Genesis of data islands
• it's difficult to give (my preciousss...)
• it's difficult to take (the NIH syndrome: a stranger in my house??)

Data islands are harmful, particularly for languages
• with lower amount of financial support,
• lesser chances of growing armies of their own researchers,
• endangered.

Let us fight them – in such a way that will address the broadest range of problems
in a single move:
• preserve data,
• give it a chance to improve,
• make it possible to train researchers,
• at a low cost.

P. Bański and B. Wójtowicz, AfLaT, Valetta, Malta, 18 May 2010 2 of 18

Data islands: it's mine, mine...

Various reasons:
• sources closed (but have you asked for re-licensing?)
• unsustainable format (Word, etc.)
• it's my sweat and blood (read: time and expertise), so why should I give it out
for free? or
• I don't want them to have a head start... let them sweat (and pay) too; or
• it's too small, no one would be interested.


Data islands: the NIH syndrome
NIH = Not Invented Here

Sometimes it's quite understandable:
• unclear methodology,
• irreproducible results,
• “wrong” theoretical approach.

But if the above is controlled for, the NIH remains an irreflexive, bad habit with no
justification.


Let's fight that: FreeDict '09
We've already tried once here, with meagre success within this community:

Bański + Wójtowicz (AfLaT 2009). A Repository of Free Lexical Resources for
African Languages: The Project and the Method.

http://freedict.org

BTW, FreeDict is doing well, and our invitation stands.

As far as African languages are concerned, we have added
• a small Swahili-Polish dictionary, and
• the Swahili-English dictionary is getting upgraded with Arabic script variants of
headwords.


Let's try again: OCTC '10
OCTC = Open-Content Text Corpus

This is our recipe for difficulties in giving, and difficulties in taking:
• a platform to store your data securely, and
• to make it possible for you to have others improve on it, for mutual benefit;
• a platform to distribute your research,
• to present your methodology transparently, and
• without injecting your “wrong” theory into the data;
• a platform to give you new perspectives for research, and
• to let you get engaged in cooperation with other researchers in your area.

(Lots of promises, let's see.)

http://OCTC.sourceforge.net/


Gross structure of the OCTC
Modules:
• lg/ – monolingual subcorpora (e.g. Polish, Swahili, etc.)
• align/ – aligned bi/multilingual parallel subcorpora (e.g. Polish-Swahili, etc.)
• core/ – the trunk of the corpus (schemas, tools, corpus-level metadata)

For the time being, the OCTC contains “seeds” (minimal subcorpora in the form of
the Universal Declaration of Human Rights) for 55 languages.

→ this is very important, because these seeds are where individual researchers
can continue from:
• they have a working model to base the format of their data on,
• they are likely to see tools that will operate on what they contribute from day 1.

In some cases, the seeds have already grown (the Polish and Swahili subcorpora,
Czech data coming soon).

The UDHR subcorpus is at the same time a parallel subcorpus (!)


Format of the OCTC
The OCTC is encoded in XML,
following the recommendations of the Text Encoding Initiative (TEI)

TEI:
• a de facto standard for text encoding in the Humanities:
• manuscript encoding through
• dictionary encoding to
• corpus encoding and
• archive encoding (e.g. in the Gutenberg Project).

• The first corpus encoded in the TEI: BNC (British National Corpus);
• Many others followed, including one we are involved in, namely the NCP/NKJP
(National Corpus of Polish – 109 billion segments, already available for
searching);
• Piotr was the XML architect for the NCP; the format of the OCTC is an
extension of his ideas and the input of the NCP team.


OCTC for Africanists (I)
Part of the SourceForge,
• a stable environment for creation and distribution of data and tools,
• with nearly 20 mirrors world-wide (a disk crash is not a problem...)
→ your data is safe here, and you don't need to worry about the distribution.

Open-source tools, open-content data
→ no methodological doubts, you can see what has been done (data manipulation
is transparent); the accompanying tools are open-source as well.

Version-control mechanism (Subversion)
→ it is possible to take snapshots of the corpus before any measurements are
performed (reproducibility of results!)


OCTC for Africanists (II)

Stand-off annotation format:
• the source text is kept separate, in a nearly un-annotated form,
• annotations (in separate files) form layers over the source data; they provide
views of the data;
• there is thus no danger of “polluting” the source data with the “wrong” theory –
you can always add your own segmentation, your own layer of POS markup,
your favourite syntactic modelling, etc...

Licensing (GNU General Public License)
• a guarantee that your data will remain free and
• that you will be able to use it after others have improved/enlarged it
(the same is true about your tools).


A wealth of research directions
• just data storage
• data interactions (alignment)
• tool testing/training
• methodology testing (e.g. memory-based approaches)
◦taggers,
◦sentencers
◦segmenters
• research/programming environment testing: UIMA, GATE, ...?
• applications in
◦lexicography,
◦machine translation,
◦translating aids (see below),
◦information extraction,
◦etc., etc. (whatever corpora are good for)


Licensing issues
GNU General Public License:
• it's free as in “free speech”
• you can do anything with a GPL-ed resource, on one condition:
• the GPL gene stays in the family (once it's made free, the resource, and its
descendants, have to remain free)

This is the only sensible way to handle fragile resources from “non-central”, “low-
density”, “under-represented” languages – otherwise
• data might be lost irretrievably,
• methods of creating it are not transparent, and therefore experiments are not
reproducible,
• when GPL-ed, data has a chance to get refined and come back to you for
further processing.


GNU GPL questions
Will they sell my work??
• will they? after all, people can get it for free
• but if they will, they are bound to keep the information about the authorship

But I am entitled to make a living on this!
• by all means – if you can sell it, sell it, but perhaps consider releasing an earlier
version of your resources under the GPL
• (why?) for those who speak the language that you are able to sell, so that the
language also gets support and maybe grows its own specialists, thanks
partially to the data you release.

How else will I earn money?
• there's more than one way (consulting, making your name known in the
community: academic/industry career).

I want to co-operate with a large company
• but then you do not control your data's destiny – your data and effort may get
lost due to the company's decision: Scannell, 2008, on Irish support dropped by
Apple and Dzongkha support dropped by Microsoft.

Community-building issues

Sourceforge has all the community-building facilities one could dream of:
• mailing lists
• bulletin board
• bug/issue tracker
• project newsfeed
• wiki

Administration:
• we want to stay in the background, giving control of individual subprojects to the
particular subcommunities;
• we want to watch over the format and over the licensing, and that is basically all


The Omega adventure – recipe
• take one small FreeDict dictionary
(Swahili-Polish, edited years ago by Beata Wójtowicz and remaining a data-
island for several years, until I converted it into the TEI for FreeDict)

→ prepare a flat TSV glossary

• take one OCTC aligned text for the same pair of languages
(Swahili-Polish aligned text of the Universal Declaration of Human Rights)

→ convert it to TMX bitext (by an OCTC tool; all of that can be verified in the SVN)

• plug both into OmegaT+ – an open-source translation aid

→ see the next slide for a little demo:


Proof of concept: OmegaT+


Conclusion

Perspectives: open-ended!

We'd like to co-operate with people whom we more-or-less know, so...

...talk to us during the break, ok? :-)


Thank you

http://OCTC.sourceforge.net/


Open-Content Text Corpus Provides Platform for African Language Data

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (20)

Similar to Open-Content Text Corpus Provides Platform for African Language Data

Similar to Open-Content Text Corpus Provides Platform for African Language Data (20)

More from Guy De Pauw

More from Guy De Pauw (20)

Recently uploaded

Recently uploaded (20)

Open-Content Text Corpus Provides Platform for African Language Data