Thamizhi-Language Processing Tools
Kengatharaiyer Sarveswaran (Sarves)
sarves@cse.mrt.ac.lk
Department of Computer Science and Engineering
University of Moratuwa, Sri Lanka.
December 12, 2020
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 1 / 10
Overview
Thamizhi-Preprocessor
ThamizhiPOSt: Tamil POS Tagger
ThamizhiMorph: Tamil Morphological Analyser/Generator
ThamizhiUDp: Tamil Universal Dependency Parser
ThamizhiLFG: Computational Grammar for Tamil using LFG
What we need
Acknowledgement
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 2 / 10
Thamizhi-Preprocessor
Validate words using Nanool grammar
Normalise Unicode points
க ,ெ ,ா, க, ் ,க ,ு -> க , ொ, க ,், க ,ு
Home page:
http://nlp-tools.uom.lk/thamizhi-preprocessor/
How to use:
-Download the script from the site:
python3 thamizhi-preprocessor.py -validate word-to-be-validated
python3 thamizhi-preprocessor.py -normalise file-to-be-normalised
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 3 / 10
ThamizhiPOSt: Tamil POS Tagger
Harmonised BIS1
- Amrita2
- UPOS3
Tagsets
Used Universal POS Tagset
Trained the POS tagger using Stanza
Trained using Amrita data (mapped to UPOS)
F1 score - 93.27 (Nov, 2020)
Trained models and POS tagged data are available for download
Home page:
http://nlp-tools.uom.lk/thamizhi-pos/
How to use:
python3 thamizhi-post.py ”input-file”
1tdil-dc.in/tdildcMain/articles/134692Draft%20POS%20Tag%20standard.pdf
2www.amrita.edu/publication/tamil-pos-tagging-using-linear-programming
3universaldependencies.org/u/pos/
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 4 / 10
ThamizhiMorph: Morphological Analyser/Generator
Rule-based (Finite-State Transducer) implementation
Implemented using foma4
Handles Verbs, Nouns, and other particles
Generates all analyses
Can be used for morph segmentation
வந்தான் வா|+verb|+fin|+sim|+strong|+past=(
ந்)த்|+3sgm=ஆன்)
All the models, data and scripts are available
Home page:
http://nlp-tools.uom.lk/thamizhi-morph/
How to use:
python3 thamizhi-morph.py ”input-file”
4fomafst.github.io/
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 5 / 10
ThamizhiUDp: Universal Dependency Parser 1/2
Hybrid approach
Multilingual Learning (with Hindi/Turkish/Telugu) for Parsing
Labelled Assigned Score - 62.39
All the data, models and scripts are available
Step Tool Dataset
Tokenisation Stanza Tamil UDT
Multi-word tokeniser Stanza Tamil UDT
Lemmatisation Stanza Tamil UDT
POS tagging ThamizhiPOSt Amrita Data
Morphological tagging ThamizhiMorph Rule-based
Dependency parsing uuparser UDT Hindi/Tamil
Home page:
http://nlp-tools.uom.lk/thamizhi-udp/
How to use:
./parse.sh ”input-file”
Note: Input file should be in CoNLL-U format.
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 6 / 10
ThamizhiUDp: Universal Dependency Parser 2/2
Tamil Modern Written Tamil Treebank:
https://github.com/UniversalDependencies/UDT amil −
MWTT/tree/master
A joint work together with Dr.K. Parameswari
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 7 / 10
ThamizhiLFG: Computational Grammar for Tamil
An initial version, covering 160 sentences (ParGram5
+ Grade-1
Tamil textbook) available
Simple intransitive, transitive, ditransitive, conjunctions are covered
Limited vocabulary, will integrate ThamizhiMorph
Hosted in the INESS site
How to use: https://clarino.uib.no/iness/xle-web
5https://pargram.w.uib.no/
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 8 / 10
What we need:
People with linguistic knowledge to review tools/annotated data
Benchmark data-sets for evaluation
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 9 / 10
Acknowledgement
Supervisors:
Prof. Gihan Dias, University of Moratuwa
Prof. Miriam Butt, University of Konstanz
Collaborators:
Dr. K. Parameswari, University of Hyderabad
Ms. S. Rajamathangi, Jawaharlal Nehru University
Scholars who have provided valuable inputs:
Prof. S.Rajendren, Prof. S.Ramesh, Colleagues at NLPC
Most of these works were supported by the Accelerating Higher
Education Expansion and Development (AHEAD) Operation of the
Ministry of Higher Education, Sri Lanka funded by the World Bank, and
by the DAAD (German Academic Exchange Office).
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 10 / 10

Thamizhi Language Processing Tools

  • 1.
    Thamizhi-Language Processing Tools KengatharaiyerSarveswaran (Sarves) sarves@cse.mrt.ac.lk Department of Computer Science and Engineering University of Moratuwa, Sri Lanka. December 12, 2020 Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 1 / 10
  • 2.
    Overview Thamizhi-Preprocessor ThamizhiPOSt: Tamil POSTagger ThamizhiMorph: Tamil Morphological Analyser/Generator ThamizhiUDp: Tamil Universal Dependency Parser ThamizhiLFG: Computational Grammar for Tamil using LFG What we need Acknowledgement Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 2 / 10
  • 3.
    Thamizhi-Preprocessor Validate words usingNanool grammar Normalise Unicode points க ,ெ ,ா, க, ் ,க ,ு -> க , ொ, க ,், க ,ு Home page: http://nlp-tools.uom.lk/thamizhi-preprocessor/ How to use: -Download the script from the site: python3 thamizhi-preprocessor.py -validate word-to-be-validated python3 thamizhi-preprocessor.py -normalise file-to-be-normalised Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 3 / 10
  • 4.
    ThamizhiPOSt: Tamil POSTagger Harmonised BIS1 - Amrita2 - UPOS3 Tagsets Used Universal POS Tagset Trained the POS tagger using Stanza Trained using Amrita data (mapped to UPOS) F1 score - 93.27 (Nov, 2020) Trained models and POS tagged data are available for download Home page: http://nlp-tools.uom.lk/thamizhi-pos/ How to use: python3 thamizhi-post.py ”input-file” 1tdil-dc.in/tdildcMain/articles/134692Draft%20POS%20Tag%20standard.pdf 2www.amrita.edu/publication/tamil-pos-tagging-using-linear-programming 3universaldependencies.org/u/pos/ Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 4 / 10
  • 5.
    ThamizhiMorph: Morphological Analyser/Generator Rule-based(Finite-State Transducer) implementation Implemented using foma4 Handles Verbs, Nouns, and other particles Generates all analyses Can be used for morph segmentation வந்தான் வா|+verb|+fin|+sim|+strong|+past=( ந்)த்|+3sgm=ஆன்) All the models, data and scripts are available Home page: http://nlp-tools.uom.lk/thamizhi-morph/ How to use: python3 thamizhi-morph.py ”input-file” 4fomafst.github.io/ Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 5 / 10
  • 6.
    ThamizhiUDp: Universal DependencyParser 1/2 Hybrid approach Multilingual Learning (with Hindi/Turkish/Telugu) for Parsing Labelled Assigned Score - 62.39 All the data, models and scripts are available Step Tool Dataset Tokenisation Stanza Tamil UDT Multi-word tokeniser Stanza Tamil UDT Lemmatisation Stanza Tamil UDT POS tagging ThamizhiPOSt Amrita Data Morphological tagging ThamizhiMorph Rule-based Dependency parsing uuparser UDT Hindi/Tamil Home page: http://nlp-tools.uom.lk/thamizhi-udp/ How to use: ./parse.sh ”input-file” Note: Input file should be in CoNLL-U format. Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 6 / 10
  • 7.
    ThamizhiUDp: Universal DependencyParser 2/2 Tamil Modern Written Tamil Treebank: https://github.com/UniversalDependencies/UDT amil − MWTT/tree/master A joint work together with Dr.K. Parameswari Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 7 / 10
  • 8.
    ThamizhiLFG: Computational Grammarfor Tamil An initial version, covering 160 sentences (ParGram5 + Grade-1 Tamil textbook) available Simple intransitive, transitive, ditransitive, conjunctions are covered Limited vocabulary, will integrate ThamizhiMorph Hosted in the INESS site How to use: https://clarino.uib.no/iness/xle-web 5https://pargram.w.uib.no/ Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 8 / 10
  • 9.
    What we need: Peoplewith linguistic knowledge to review tools/annotated data Benchmark data-sets for evaluation Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 9 / 10
  • 10.
    Acknowledgement Supervisors: Prof. Gihan Dias,University of Moratuwa Prof. Miriam Butt, University of Konstanz Collaborators: Dr. K. Parameswari, University of Hyderabad Ms. S. Rajamathangi, Jawaharlal Nehru University Scholars who have provided valuable inputs: Prof. S.Rajendren, Prof. S.Ramesh, Colleagues at NLPC Most of these works were supported by the Accelerating Higher Education Expansion and Development (AHEAD) Operation of the Ministry of Higher Education, Sri Lanka funded by the World Bank, and by the DAAD (German Academic Exchange Office). Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 10 / 10