1. Thamizhi-Language Processing Tools
Kengatharaiyer Sarveswaran (Sarves)
sarves@cse.mrt.ac.lk
Department of Computer Science and Engineering
University of Moratuwa, Sri Lanka.
December 12, 2020
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 1 / 10
2. Overview
Thamizhi-Preprocessor
ThamizhiPOSt: Tamil POS Tagger
ThamizhiMorph: Tamil Morphological Analyser/Generator
ThamizhiUDp: Tamil Universal Dependency Parser
ThamizhiLFG: Computational Grammar for Tamil using LFG
What we need
Acknowledgement
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 2 / 10
3. Thamizhi-Preprocessor
Validate words using Nanool grammar
Normalise Unicode points
க ,ெ ,ா, க, ் ,க ,ு -> க , ொ, க ,், க ,ு
Home page:
http://nlp-tools.uom.lk/thamizhi-preprocessor/
How to use:
-Download the script from the site:
python3 thamizhi-preprocessor.py -validate word-to-be-validated
python3 thamizhi-preprocessor.py -normalise file-to-be-normalised
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 3 / 10
4. ThamizhiPOSt: Tamil POS Tagger
Harmonised BIS1
- Amrita2
- UPOS3
Tagsets
Used Universal POS Tagset
Trained the POS tagger using Stanza
Trained using Amrita data (mapped to UPOS)
F1 score - 93.27 (Nov, 2020)
Trained models and POS tagged data are available for download
Home page:
http://nlp-tools.uom.lk/thamizhi-pos/
How to use:
python3 thamizhi-post.py ”input-file”
1tdil-dc.in/tdildcMain/articles/134692Draft%20POS%20Tag%20standard.pdf
2www.amrita.edu/publication/tamil-pos-tagging-using-linear-programming
3universaldependencies.org/u/pos/
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 4 / 10
5. ThamizhiMorph: Morphological Analyser/Generator
Rule-based (Finite-State Transducer) implementation
Implemented using foma4
Handles Verbs, Nouns, and other particles
Generates all analyses
Can be used for morph segmentation
வந்தான் வா|+verb|+fin|+sim|+strong|+past=(
ந்)த்|+3sgm=ஆன்)
All the models, data and scripts are available
Home page:
http://nlp-tools.uom.lk/thamizhi-morph/
How to use:
python3 thamizhi-morph.py ”input-file”
4fomafst.github.io/
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 5 / 10
6. ThamizhiUDp: Universal Dependency Parser 1/2
Hybrid approach
Multilingual Learning (with Hindi/Turkish/Telugu) for Parsing
Labelled Assigned Score - 62.39
All the data, models and scripts are available
Step Tool Dataset
Tokenisation Stanza Tamil UDT
Multi-word tokeniser Stanza Tamil UDT
Lemmatisation Stanza Tamil UDT
POS tagging ThamizhiPOSt Amrita Data
Morphological tagging ThamizhiMorph Rule-based
Dependency parsing uuparser UDT Hindi/Tamil
Home page:
http://nlp-tools.uom.lk/thamizhi-udp/
How to use:
./parse.sh ”input-file”
Note: Input file should be in CoNLL-U format.
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 6 / 10
7. ThamizhiUDp: Universal Dependency Parser 2/2
Tamil Modern Written Tamil Treebank:
https://github.com/UniversalDependencies/UDT amil −
MWTT/tree/master
A joint work together with Dr.K. Parameswari
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 7 / 10
8. ThamizhiLFG: Computational Grammar for Tamil
An initial version, covering 160 sentences (ParGram5
+ Grade-1
Tamil textbook) available
Simple intransitive, transitive, ditransitive, conjunctions are covered
Limited vocabulary, will integrate ThamizhiMorph
Hosted in the INESS site
How to use: https://clarino.uib.no/iness/xle-web
5https://pargram.w.uib.no/
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 8 / 10
9. What we need:
People with linguistic knowledge to review tools/annotated data
Benchmark data-sets for evaluation
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 9 / 10
10. Acknowledgement
Supervisors:
Prof. Gihan Dias, University of Moratuwa
Prof. Miriam Butt, University of Konstanz
Collaborators:
Dr. K. Parameswari, University of Hyderabad
Ms. S. Rajamathangi, Jawaharlal Nehru University
Scholars who have provided valuable inputs:
Prof. S.Rajendren, Prof. S.Ramesh, Colleagues at NLPC
Most of these works were supported by the Accelerating Higher
Education Expansion and Development (AHEAD) Operation of the
Ministry of Higher Education, Sri Lanka funded by the World Bank, and
by the DAAD (German Academic Exchange Office).
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 10 / 10