SlideShare a Scribd company logo
1 of 10
Download to read offline
Thamizhi-Language Processing Tools
Kengatharaiyer Sarveswaran (Sarves)
sarves@cse.mrt.ac.lk
Department of Computer Science and Engineering
University of Moratuwa, Sri Lanka.
December 12, 2020
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 1 / 10
Overview
Thamizhi-Preprocessor
ThamizhiPOSt: Tamil POS Tagger
ThamizhiMorph: Tamil Morphological Analyser/Generator
ThamizhiUDp: Tamil Universal Dependency Parser
ThamizhiLFG: Computational Grammar for Tamil using LFG
What we need
Acknowledgement
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 2 / 10
Thamizhi-Preprocessor
Validate words using Nanool grammar
Normalise Unicode points
க ,ெ ,ா, க, ் ,க ,ு -> க , ொ, க ,், க ,ு
Home page:
http://nlp-tools.uom.lk/thamizhi-preprocessor/
How to use:
-Download the script from the site:
python3 thamizhi-preprocessor.py -validate word-to-be-validated
python3 thamizhi-preprocessor.py -normalise file-to-be-normalised
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 3 / 10
ThamizhiPOSt: Tamil POS Tagger
Harmonised BIS1
- Amrita2
- UPOS3
Tagsets
Used Universal POS Tagset
Trained the POS tagger using Stanza
Trained using Amrita data (mapped to UPOS)
F1 score - 93.27 (Nov, 2020)
Trained models and POS tagged data are available for download
Home page:
http://nlp-tools.uom.lk/thamizhi-pos/
How to use:
python3 thamizhi-post.py ”input-file”
1tdil-dc.in/tdildcMain/articles/134692Draft%20POS%20Tag%20standard.pdf
2www.amrita.edu/publication/tamil-pos-tagging-using-linear-programming
3universaldependencies.org/u/pos/
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 4 / 10
ThamizhiMorph: Morphological Analyser/Generator
Rule-based (Finite-State Transducer) implementation
Implemented using foma4
Handles Verbs, Nouns, and other particles
Generates all analyses
Can be used for morph segmentation
வந்தான் வா|+verb|+fin|+sim|+strong|+past=(
ந்)த்|+3sgm=ஆன்)
All the models, data and scripts are available
Home page:
http://nlp-tools.uom.lk/thamizhi-morph/
How to use:
python3 thamizhi-morph.py ”input-file”
4fomafst.github.io/
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 5 / 10
ThamizhiUDp: Universal Dependency Parser 1/2
Hybrid approach
Multilingual Learning (with Hindi/Turkish/Telugu) for Parsing
Labelled Assigned Score - 62.39
All the data, models and scripts are available
Step Tool Dataset
Tokenisation Stanza Tamil UDT
Multi-word tokeniser Stanza Tamil UDT
Lemmatisation Stanza Tamil UDT
POS tagging ThamizhiPOSt Amrita Data
Morphological tagging ThamizhiMorph Rule-based
Dependency parsing uuparser UDT Hindi/Tamil
Home page:
http://nlp-tools.uom.lk/thamizhi-udp/
How to use:
./parse.sh ”input-file”
Note: Input file should be in CoNLL-U format.
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 6 / 10
ThamizhiUDp: Universal Dependency Parser 2/2
Tamil Modern Written Tamil Treebank:
https://github.com/UniversalDependencies/UDT amil −
MWTT/tree/master
A joint work together with Dr.K. Parameswari
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 7 / 10
ThamizhiLFG: Computational Grammar for Tamil
An initial version, covering 160 sentences (ParGram5
+ Grade-1
Tamil textbook) available
Simple intransitive, transitive, ditransitive, conjunctions are covered
Limited vocabulary, will integrate ThamizhiMorph
Hosted in the INESS site
How to use: https://clarino.uib.no/iness/xle-web
5https://pargram.w.uib.no/
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 8 / 10
What we need:
People with linguistic knowledge to review tools/annotated data
Benchmark data-sets for evaluation
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 9 / 10
Acknowledgement
Supervisors:
Prof. Gihan Dias, University of Moratuwa
Prof. Miriam Butt, University of Konstanz
Collaborators:
Dr. K. Parameswari, University of Hyderabad
Ms. S. Rajamathangi, Jawaharlal Nehru University
Scholars who have provided valuable inputs:
Prof. S.Rajendren, Prof. S.Ramesh, Colleagues at NLPC
Most of these works were supported by the Accelerating Higher
Education Expansion and Development (AHEAD) Operation of the
Ministry of Higher Education, Sri Lanka funded by the World Bank, and
by the DAAD (German Academic Exchange Office).
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 10 / 10

More Related Content

What's hot

Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approachvini89
 
Hindi –tamil text translation
Hindi –tamil text translationHindi –tamil text translation
Hindi –tamil text translationVaibhav Agarwal
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
 
Corpus-Based Vocabulary Learning in Technical English
Corpus-Based Vocabulary Learning in Technical EnglishCorpus-Based Vocabulary Learning in Technical English
Corpus-Based Vocabulary Learning in Technical EnglishCSCJournals
 
A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...eSAT Journals
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to HindiRajat Jain
 
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTKANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTKijnlc
 
Design of A Spell Corrector For Hausa Language
Design of A Spell Corrector For Hausa LanguageDesign of A Spell Corrector For Hausa Language
Design of A Spell Corrector For Hausa LanguageWaqas Tariq
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indianeSAT Publishing House
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
 
Introduction to development of lexical databases
Introduction to development of lexical databasesIntroduction to development of lexical databases
Introduction to development of lexical databasesMuhammad Shoaib Chaudhary
 
Marathi Text-To-Speech Synthesis using Natural Language Processing
Marathi Text-To-Speech Synthesis using Natural Language ProcessingMarathi Text-To-Speech Synthesis using Natural Language Processing
Marathi Text-To-Speech Synthesis using Natural Language Processingiosrjce
 

What's hot (20)

Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approach
 
Hindi –tamil text translation
Hindi –tamil text translationHindi –tamil text translation
Hindi –tamil text translation
 
E1 geetha2 karthikeyan
E1 geetha2 karthikeyanE1 geetha2 karthikeyan
E1 geetha2 karthikeyan
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi Language
 
I1 geetha3 revathi
I1 geetha3 revathiI1 geetha3 revathi
I1 geetha3 revathi
 
Corpus-Based Vocabulary Learning in Technical English
Corpus-Based Vocabulary Learning in Technical EnglishCorpus-Based Vocabulary Learning in Technical English
Corpus-Based Vocabulary Learning in Technical English
 
A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to Hindi
 
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTKANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
 
Design of A Spell Corrector For Hausa Language
Design of A Spell Corrector For Hausa LanguageDesign of A Spell Corrector For Hausa Language
Design of A Spell Corrector For Hausa Language
 
G1803013542
G1803013542G1803013542
G1803013542
 
Detecting Paraphrases in Marathi Language
Detecting Paraphrases in Marathi LanguageDetecting Paraphrases in Marathi Language
Detecting Paraphrases in Marathi Language
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indian
 
J1803015357
J1803015357J1803015357
J1803015357
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
 
Introduction to development of lexical databases
Introduction to development of lexical databasesIntroduction to development of lexical databases
Introduction to development of lexical databases
 
Marathi Text-To-Speech Synthesis using Natural Language Processing
Marathi Text-To-Speech Synthesis using Natural Language ProcessingMarathi Text-To-Speech Synthesis using Natural Language Processing
Marathi Text-To-Speech Synthesis using Natural Language Processing
 
P1803018289
P1803018289P1803018289
P1803018289
 

Similar to Thamizhi Language Processing Tools

Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text ProcessingSuneel Marthi
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextDataWorks Summit
 
Improving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati LanguageImproving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati Languageijistjournal
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the societyLifeng (Aaron) Han
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002IJARTES
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
Worldranking universities final documentation
Worldranking universities final documentationWorldranking universities final documentation
Worldranking universities final documentationBhadra Gowdra
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationChamani Shiranthika
 
Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
 
The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015Shinnosuke Takamichi
 
Quality estimation of machine translation outputs through stemming
Quality estimation of machine translation outputs through stemmingQuality estimation of machine translation outputs through stemming
Quality estimation of machine translation outputs through stemmingijcsa
 
Speech To Speech Translation
Speech To Speech TranslationSpeech To Speech Translation
Speech To Speech TranslationIRJET Journal
 
Tutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemTutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemIJERA Editor
 
Design and Development of a Malayalam to English Translator- A Transfer Based...
Design and Development of a Malayalam to English Translator- A Transfer Based...Design and Development of a Malayalam to English Translator- A Transfer Based...
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHIRJET Journal
 
Tamil Morphological Analysis
Tamil Morphological AnalysisTamil Morphological Analysis
Tamil Morphological AnalysisKarthik Sankar
 

Similar to Thamizhi Language Processing Tools (20)

Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
Improving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati LanguageImproving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati Language
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the society
 
Ijetcas14 444
Ijetcas14 444Ijetcas14 444
Ijetcas14 444
 
D3 dhanalakshmi
D3 dhanalakshmiD3 dhanalakshmi
D3 dhanalakshmi
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
ylchen
ylchenylchen
ylchen
 
Worldranking universities final documentation
Worldranking universities final documentationWorldranking universities final documentation
Worldranking universities final documentation
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and Persian
 
The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015
 
Quality estimation of machine translation outputs through stemming
Quality estimation of machine translation outputs through stemmingQuality estimation of machine translation outputs through stemming
Quality estimation of machine translation outputs through stemming
 
Speech To Speech Translation
Speech To Speech TranslationSpeech To Speech Translation
Speech To Speech Translation
 
Tutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemTutorial - Speech Synthesis System
Tutorial - Speech Synthesis System
 
Design and Development of a Malayalam to English Translator- A Transfer Based...
Design and Development of a Malayalam to English Translator- A Transfer Based...Design and Development of a Malayalam to English Translator- A Transfer Based...
Design and Development of a Malayalam to English Translator- A Transfer Based...
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
 
Tamil Morphological Analysis
Tamil Morphological AnalysisTamil Morphological Analysis
Tamil Morphological Analysis
 
TAUS Moses Roundtable, Prague, 11 September 2013
TAUS Moses Roundtable, Prague, 11 September 2013TAUS Moses Roundtable, Prague, 11 September 2013
TAUS Moses Roundtable, Prague, 11 September 2013
 

More from Kengatharaiyer Sarveswaran

Natural Language Processing for Tamil and Sinhala
Natural Language Processing for Tamil and SinhalaNatural Language Processing for Tamil and Sinhala
Natural Language Processing for Tamil and SinhalaKengatharaiyer Sarveswaran
 
Department of Education - Northern Province - Grade 5 paper
Department of Education - Northern Province - Grade 5 paperDepartment of Education - Northern Province - Grade 5 paper
Department of Education - Northern Province - Grade 5 paperKengatharaiyer Sarveswaran
 
Concept paper for Educational Management Information System
Concept paper for Educational Management Information SystemConcept paper for Educational Management Information System
Concept paper for Educational Management Information SystemKengatharaiyer Sarveswaran
 
21ம் நூற்றாண்டில் இணையக் கல்வியின் முக்கியத்துவம்
21ம் நூற்றாண்டில் இணையக் கல்வியின் முக்கியத்துவம்21ம் நூற்றாண்டில் இணையக் கல்வியின் முக்கியத்துவம்
21ம் நூற்றாண்டில் இணையக் கல்வியின் முக்கியத்துவம்Kengatharaiyer Sarveswaran
 
Teaching and Learning in Northern Province, Sri Lanka
Teaching and Learning in Northern Province, Sri LankaTeaching and Learning in Northern Province, Sri Lanka
Teaching and Learning in Northern Province, Sri LankaKengatharaiyer Sarveswaran
 

More from Kengatharaiyer Sarveswaran (14)

Natural Language Processing for Tamil and Sinhala
Natural Language Processing for Tamil and SinhalaNatural Language Processing for Tamil and Sinhala
Natural Language Processing for Tamil and Sinhala
 
Department of Education - Northern Province - Grade 5 paper
Department of Education - Northern Province - Grade 5 paperDepartment of Education - Northern Province - Grade 5 paper
Department of Education - Northern Province - Grade 5 paper
 
Digital transformation and the SME sector
Digital transformation and the SME sectorDigital transformation and the SME sector
Digital transformation and the SME sector
 
IP and ICT - Intro
IP and ICT - IntroIP and ICT - Intro
IP and ICT - Intro
 
Concept paper for Educational Management Information System
Concept paper for Educational Management Information SystemConcept paper for Educational Management Information System
Concept paper for Educational Management Information System
 
Concept paper - DIY Innovation Center
Concept paper - DIY Innovation CenterConcept paper - DIY Innovation Center
Concept paper - DIY Innovation Center
 
Presentation - CTC
Presentation - CTCPresentation - CTC
Presentation - CTC
 
Being 21st century teacher and e-Learning
Being 21st century teacher and e-LearningBeing 21st century teacher and e-Learning
Being 21st century teacher and e-Learning
 
Using the Internet for Learning
Using the Internet for LearningUsing the Internet for Learning
Using the Internet for Learning
 
21ம் நூற்றாண்டில் இணையக் கல்வியின் முக்கியத்துவம்
21ம் நூற்றாண்டில் இணையக் கல்வியின் முக்கியத்துவம்21ம் நூற்றாண்டில் இணையக் கல்வியின் முக்கியத்துவம்
21ம் நூற்றாண்டில் இணையக் கல்வியின் முக்கியத்துவம்
 
Teaching and Learning in Northern Province, Sri Lanka
Teaching and Learning in Northern Province, Sri LankaTeaching and Learning in Northern Province, Sri Lanka
Teaching and Learning in Northern Province, Sri Lanka
 
Introduction to Electronic Learning
Introduction to Electronic LearningIntroduction to Electronic Learning
Introduction to Electronic Learning
 
Joomla Manual in Tamil
Joomla Manual in TamilJoomla Manual in Tamil
Joomla Manual in Tamil
 
Introduction to PHP
Introduction to PHPIntroduction to PHP
Introduction to PHP
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Thamizhi Language Processing Tools

  • 1. Thamizhi-Language Processing Tools Kengatharaiyer Sarveswaran (Sarves) sarves@cse.mrt.ac.lk Department of Computer Science and Engineering University of Moratuwa, Sri Lanka. December 12, 2020 Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 1 / 10
  • 2. Overview Thamizhi-Preprocessor ThamizhiPOSt: Tamil POS Tagger ThamizhiMorph: Tamil Morphological Analyser/Generator ThamizhiUDp: Tamil Universal Dependency Parser ThamizhiLFG: Computational Grammar for Tamil using LFG What we need Acknowledgement Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 2 / 10
  • 3. Thamizhi-Preprocessor Validate words using Nanool grammar Normalise Unicode points க ,ெ ,ா, க, ் ,க ,ு -> க , ொ, க ,், க ,ு Home page: http://nlp-tools.uom.lk/thamizhi-preprocessor/ How to use: -Download the script from the site: python3 thamizhi-preprocessor.py -validate word-to-be-validated python3 thamizhi-preprocessor.py -normalise file-to-be-normalised Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 3 / 10
  • 4. ThamizhiPOSt: Tamil POS Tagger Harmonised BIS1 - Amrita2 - UPOS3 Tagsets Used Universal POS Tagset Trained the POS tagger using Stanza Trained using Amrita data (mapped to UPOS) F1 score - 93.27 (Nov, 2020) Trained models and POS tagged data are available for download Home page: http://nlp-tools.uom.lk/thamizhi-pos/ How to use: python3 thamizhi-post.py ”input-file” 1tdil-dc.in/tdildcMain/articles/134692Draft%20POS%20Tag%20standard.pdf 2www.amrita.edu/publication/tamil-pos-tagging-using-linear-programming 3universaldependencies.org/u/pos/ Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 4 / 10
  • 5. ThamizhiMorph: Morphological Analyser/Generator Rule-based (Finite-State Transducer) implementation Implemented using foma4 Handles Verbs, Nouns, and other particles Generates all analyses Can be used for morph segmentation வந்தான் வா|+verb|+fin|+sim|+strong|+past=( ந்)த்|+3sgm=ஆன்) All the models, data and scripts are available Home page: http://nlp-tools.uom.lk/thamizhi-morph/ How to use: python3 thamizhi-morph.py ”input-file” 4fomafst.github.io/ Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 5 / 10
  • 6. ThamizhiUDp: Universal Dependency Parser 1/2 Hybrid approach Multilingual Learning (with Hindi/Turkish/Telugu) for Parsing Labelled Assigned Score - 62.39 All the data, models and scripts are available Step Tool Dataset Tokenisation Stanza Tamil UDT Multi-word tokeniser Stanza Tamil UDT Lemmatisation Stanza Tamil UDT POS tagging ThamizhiPOSt Amrita Data Morphological tagging ThamizhiMorph Rule-based Dependency parsing uuparser UDT Hindi/Tamil Home page: http://nlp-tools.uom.lk/thamizhi-udp/ How to use: ./parse.sh ”input-file” Note: Input file should be in CoNLL-U format. Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 6 / 10
  • 7. ThamizhiUDp: Universal Dependency Parser 2/2 Tamil Modern Written Tamil Treebank: https://github.com/UniversalDependencies/UDT amil − MWTT/tree/master A joint work together with Dr.K. Parameswari Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 7 / 10
  • 8. ThamizhiLFG: Computational Grammar for Tamil An initial version, covering 160 sentences (ParGram5 + Grade-1 Tamil textbook) available Simple intransitive, transitive, ditransitive, conjunctions are covered Limited vocabulary, will integrate ThamizhiMorph Hosted in the INESS site How to use: https://clarino.uib.no/iness/xle-web 5https://pargram.w.uib.no/ Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 8 / 10
  • 9. What we need: People with linguistic knowledge to review tools/annotated data Benchmark data-sets for evaluation Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 9 / 10
  • 10. Acknowledgement Supervisors: Prof. Gihan Dias, University of Moratuwa Prof. Miriam Butt, University of Konstanz Collaborators: Dr. K. Parameswari, University of Hyderabad Ms. S. Rajamathangi, Jawaharlal Nehru University Scholars who have provided valuable inputs: Prof. S.Rajendren, Prof. S.Ramesh, Colleagues at NLPC Most of these works were supported by the Accelerating Higher Education Expansion and Development (AHEAD) Operation of the Ministry of Higher Education, Sri Lanka funded by the World Bank, and by the DAAD (German Academic Exchange Office). Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 10 / 10