Towards The Manding Corpus: Texts Selection Principles and Metatext Markup

  • 345 views
Uploaded on

© Artem Davydov

© Artem Davydov

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
345
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Towards The Manding Corpus: Texts Selection Principles and Metatext Markup Artem Davydov St. Petersburg State University, Faculty of Oriental and African Studies Paper presented on the AfLaT workshop Malta, May 18, 2010
  • 2.  
  • 3. Text selection principles
    • Two criteria proposed by Sinclair (2005):
    • internal criterion ( takes into consideration the communicative function of a text )
    • external criterion ( reflects details of the language)
    • Sinclair’s conclusion: the contents of a corpus should be selected disregarding the language they contain
  • 4. Oral texts
    • Multilingualism:
    • Bamana-French code-switching
    • unadapted loanwords
    • Two groups of oral texts to be distinguished
    • spontaneously generated texts;
    • texts which had been first written and then pronounced.
  • 5. Written texts
    • published folklore texts
    • fiction books
    • the press
    • educational and religious literature
    • Nko publications
  • 6. Fiction books
    • the language they contain is often highly “unnatural”
    • working method typical for Malian writers: first writing in French, then translating to Bamana
  • 7. Nko publications
    • are mostly in Maninka, but Bamana texts are also available
    • tones are already indicated
    • can be automatically converted to Roman script
    • the written language has significantly evolved from the spoken language
  • 8. The Web
    • a predominantly oral language
    • only few resources are available online
    • the quality of online resources is low
  • 9. Some preliminary conclusions
    • Certain types of texts cause doubt, but under existing conditions it would be to wasteful to reject any text data.
    • Every text must be provided with detailed metadata in order to allow the user to single out the sub-corpora.
  • 10. Metatext Markup
    • Groups of parameters:
    • text metadata
    • data about the author
    • writing system
    • genre
    • technical metadata
  • 11. Text metadata
    • title of the text
    • date of creation
    • original text or translation
    • for translations: the language of original
    • channel: written, oral, Internet
  • 12. Data about the author
    • name
    • gender
    • age
    • native language(s)
    • places of predominant linguistic socialization
  • 13. Writing system
    • Old Malian Bamana orthography
    • New Malian Bamana orthography
    • French-based spontaneous orthography (“colonial”): ou = /u/, gn =/ɲ/, acute accents on the final vowels, etc.
    • The Adjami writing system based on the Arabic alphabet.
    • The N’ko writing system
    • Other Roman-based orthographies.
  • 14. Genres/Subject matters
    • Folklore : Popular tales, Anecdotes, Epics, Proverbs, Traditional Songs, Koteba (Traditional Theatre), Riddles.
    • Fiction : Prose, Movie Scripts, Theatre, Modern Poetry, Popular Songs Lyrics.
    • Religious : Christian, Islamic, Other .
    • Educational : Formal Education, Vulgarization, Popular Science.
    • Academic Writings : History, Lingustics.
    • Personal communication : Personal Records, Correspondence , Dialogs.
    • Information : Advertising, News, Column, Narrative, Interview, Public Speech.
  • 15. Technical metadata
    • Finally, to trace the corpus updating each text is to be provided with the following information:
    • the name of the project member who added the text to the corpus;
    • the date of the adding the text to the corpus.
  • 16. Conclusion
    • The suggested metatext markup system will provide a user with the ability to create sub-corpora with the specified parameters. It will also help to control the process of filling the corpus with new text data and to estimate the balance of the corpus.
  • 17. Thank you!