Embed presentation
Download to read offline





This document discusses tokenization and computational verb morphology for Setswana, a Bantu language with a disjunctive orthography. It presents an approach that combines two tokenization transducers and a morphological analyzer to effectively tokenize Setswana text. The approach was tested on a short Setswana text and achieved 93.6% accuracy between the automatically and hand-tokenized texts. While mostly successful, some issues remained around longest matches that were not valid tokens or did not allow morphological analysis. Overall, the approach demonstrated that a precise tokenizer and morphological analyzer can largely resolve the challenges of Setswana's disjunctive writing system.



