An Efficient Rule-Based System forAn Efficient Rule-Based System for
Morphological Parsing of Tamil LanguageMorphological Parsing of Tamil Language
தமிழ் உருபனியல் ஆய்வுதமிழ் உருபனியல் ஆய்வு
STUDENTS:
Karthik S 106106029
Praveen Kumar 106106045
Venkataraman GB 106106073
GUIDE:
Dr.V. Gopalakrishnan
Final Semester Project
Department of Computer Science and Engineering
National Institute of Technology, Tiruchirappalli
May 2010
AgendaAgenda
 Overview of the Project
 NLP Applications –The Stakeholders
 The problem at hand
 The proposed solution
◦ Rule – Based Morphological Analysis
◦ Machine Learning
 Where does it all fit in ?
 Need for Tamil Morphological Analysis
 Resources Obtained
 Implementation Details
 Demonstration
 Future Scope
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW1
Overview of the ProjectOverview of the Project
 Natural Language Processing
 Morphological Analysis
 Tamil Language
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW
Morphing …
… And in Tamil
நடந்தான் நடந்தனர்
நடக்கின்றாள்
நடப்பான்
நடக்கின்றான்
2
NLP Applications – The StakeholdersNLP Applications – The Stakeholders
WHO ARE THE STAKEHOLDERS ?
Natural Language Processing Applications like:
Stemming
Machine Translation
Speech Recognition
Information Retrieval
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW
WHY ARE THESE APPLICATION THE STAKEHOLDERS ?
3
The problem at handThe problem at hand
Morphological Analysis of Tamil involves understanding the word structure and its
inflections
AGGLUTINATION IN TAMIL
Agglutination is the morphological process of adding affixes to the base of a word
Typical Tamil verb form will have a number of suffixes showing person, number, mood,
tense and voice.
INFLECTIONS IN TAMIL
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW
பால் - Gender
எண் - Number
திைண - Class
காலம் - Tenseஇடம் - Person
4
The problem at handThe problem at hand
Morphological Analysis of Tamil involves understanding the word structure and its
inflections
AGGLUTINATION IN TAMIL
Agglutination is the morphological process of adding affixes to the base of a word
Typical Tamil verb form will have a number of suffixes showing person, number, mood,
tense and voice.
INFLECTIONS IN TAMIL
Example: vAlntukkontirunt :̣ ̣ ̣ ēṉ [வாழ்ந்துகொகாண்டிருந்ேதேன்]
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW
vAḷ - வாழ் intu - ந்துக kontụ ̣ - ொகாண்ட irunta - இருந்தே ēn - ஏன்
root voice marker tense marker aspect marker person marker
live past tense
object voice
during past progressive first person,
Singular
4
The proposed solutionThe proposed solution
There are two levels called lexical and surface levels. In the surface level, a
word is represented in its original orthographic form. In the lexical level, a
word is represented by denoting all of the functional components of the word.
RULE – BASED MORPHOLOGICAL ANALYSIS
Analyzing word inflections using rules specified in Tamil Grammar
அன் ஆன் அள் ஆள் அர் ஆர் பமமார்
அஆ கடதுகற என் ஏன் அல் அன்
அம் ஆம் எம் ஏம் ஓொமா டமமர்
கடதேற ஐ ஆய் இமமின் இர்ஈர்
ஈயர் கயவ ொமனபவம் பிறவம்
விைனயின் விகதேி ொபயாினம் சிலேவ
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW5
SURFACE LEVEL LEXICAL LEVEL
நனனல்
ொதோலகாபபியம்
The proposed solutionThe proposed solution
MACHINE LEARNING APPROACH
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW6
While checking for suffixes in a given word, more than one suffix might be
possible, if the rules are strictly followed. But only one suffix is semantically
possible.
விகதேி : படிததுக – “ ”உ படிததேதுக – “ ”துக or “ ”உ ???
M/L approach helps the system in “learning” the correct parsing method for the
word, and in the subsequent processing of the same word, the wrong
possibilities are automatically eliminated.
1
Two words might share the same inflectional part.
நடககினறான் படிககினறான்
The inflectional part of every word is learnt by the system. This helps in
optimization by eliminating the need to analyse the second word again from
scratch
2
Where does it all fit in ?Where does it all fit in ?
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW7
Characters
Word – Tokenization
Morphological Analysis
Sentence Syntax Analysis
Semantic Analysis
ப டி தே் தோ ன்
படிததோன்
படி - ததே் - ஆன்
அவன் பததேகதைதேப் படிததோன்
Meaning of the sentence ???
Need for Tamil Morphological AnalysisNeed for Tamil Morphological Analysis
ENGLISH vs. TAMIL
TRANSLATION AND SEMANTIC ANALYSIS
அவன் மதுகைரகக வந்தோள் -- Semantically Wrong
To check semantic correctness of a sentence, morphological analysis is needed.
How to translate the above sentence ??
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW8
I came நான் வந்ேதேன்
You came நீ வந்தோய்
They came அவர்கள் வந்தேனர்
He came அவன் வந்தோன்
She came அவள் வந்தோள்
Resources ObtainedResources Obtained
EMILLE – CIIL TAMIL MONOLINGUAL CORPUS
Enabling Minority Language Engineering
Collaborative Venture of
◦ Lancaster University, UK
◦ Central Institute of Indian Languages (CIIL), Mysore, India
Distributed by European Language Resources Association [ELRA]
TAMIL WORDNET
The database is a semantic dictionary that is designed as a lexical network
Developed by
◦ Department of Linguistics ofTamil University
◦ AU-KBC Research Centre, Chennai
Tamil Wordnet resembles a traditional dictionary. It also contains valuable
information about morphologically related words
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW9
Implementation Details - 1Implementation Details - 1
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW10
Input Tamil Word
Check
in DB
C-V Segmentation
Root
verb ?
Backward Scanning
of inflections
Classify and
Remove Inflection
Output
Conflict Resolution
Machine Learning
No
YesYes
No
Implementation Details - 2Implementation Details - 2
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW11
படததான்
ப ட த் தா ன்
ப் - அ ட் - இ த் த் - ஆ ன்
ப் அ ட் இ த் த் ஆ ன்
பட < VERB_ROOT >
தத் < PAST TENSE >
ஆன் < 3SM >
Implementation Details - 3Implementation Details - 3
UNICODE SUPPORT FOR TAMIL
U+0B80 – U+0BFF
GOOGLE TAMIL TRANSLITERATOR IME (Input Method)
Google Transliteration IME is an input method editor which allows users to
enter text Tamil using a roman keyboard
PROGRAMMING LANGUAGE
Java
DATABASES
MySQL Databases, with JDBC to access the database
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW12
Implementation Details - 3Implementation Details - 3
TRANSLITERATION MODULE
A simple Transliterator module - to enable conversion from Tamil to English
and vice-versa
Example:
◦ அ - a
◦ ஆ - aa
◦ க - ka
HASH TABLE GENERATOR
The application uses two data files, containing a list of vigudhi and idainilai.
The Java Hash Generator Code loads the data from the workbooks, adds
them to a hash table, and serializes the data and outputs to an external data
file, which can be loaded whenever the application requires access.
30/01/15 National Institute of Technology, Tiruchirappalli
WHO WHAT WHYWHERE HOW13
Future ScopeFuture Scope
 The algorithm can be extended to cover nouns and noun forms too.
 The algorithm can be improved to incorporate stricter rules so as to reduce
conflicts that arise in the output generated by the current system.
 The algorithm can be extended for other agglutinative languages.
 The various resources obtained as a part of this project, including the
EMILLE-CIIL ELRA Corpus, the Tamil Wordnet Database and other tools can
be used for further study, research and development in the field of Natural
Language Processing at our college in the years to come.
30/01/15 National Institute of Technology, Tiruchirappalli
14
ReferencesReferences
 A Novel Approach to Morphological Analysis forTamil Language
◦ Anand kumar M1, DhanalakshmiV1, Rajendran S2, Soman K P
 Nannool and Tholkaapiyam
◦ Tamil Grammar texts
 The Morphological Generator and Parsing Engine forTamilVerb Forms.
◦ Ultimate Software Solution, Dindigul
 Morphological Analyzer forTamil
◦ Anandan. P, Ranjani Parthasarathy, Geetha T.V. [2002]
◦ ICON 2002, RCILTS-Tamil,Anna University, India.
 Morphology.A Handbook on Inflection andWord Formation
◦ Daelemans Walter, G. Booij, Ch. Lehmann, and J. Mugdan (eds.) [2004]
 Tamil Part-of-Speech tagger based on SVMTool
◦ DhanalakshmiV,Anandkumar M,Vijaya M.S, Loganathan R, Soman K.P, Rajendran S [2008]
◦ Proceedings of the COLIPS International Conference on Asian Language Processing 2008 (IALP).
 Unsupervised Learning of the Morphology of a Natural Language.
◦ John Goldsmith. [2001]
◦ Computational Linguistics, 27(2):153–198.
 Computational morphology of verbal complex
◦ Rajendran, S.,Arulmozi, S., Ramesh Kumar,Viswanathan, S. [2001]
◦ Paper read in Conference at Dravidan University, Kuppam, December 26-29, 2001.
30/01/15 National Institute of Technology, Tiruchirappalli
15
Thank youThank you
30/01/15 National Institute of Technology, Tiruchirappalli

Tamil Morphological Analysis

  • 1.
    An Efficient Rule-BasedSystem forAn Efficient Rule-Based System for Morphological Parsing of Tamil LanguageMorphological Parsing of Tamil Language தமிழ் உருபனியல் ஆய்வுதமிழ் உருபனியல் ஆய்வு STUDENTS: Karthik S 106106029 Praveen Kumar 106106045 Venkataraman GB 106106073 GUIDE: Dr.V. Gopalakrishnan Final Semester Project Department of Computer Science and Engineering National Institute of Technology, Tiruchirappalli May 2010
  • 2.
    AgendaAgenda  Overview ofthe Project  NLP Applications –The Stakeholders  The problem at hand  The proposed solution ◦ Rule – Based Morphological Analysis ◦ Machine Learning  Where does it all fit in ?  Need for Tamil Morphological Analysis  Resources Obtained  Implementation Details  Demonstration  Future Scope 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW1
  • 3.
    Overview of theProjectOverview of the Project  Natural Language Processing  Morphological Analysis  Tamil Language 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW Morphing … … And in Tamil நடந்தான் நடந்தனர் நடக்கின்றாள் நடப்பான் நடக்கின்றான் 2
  • 4.
    NLP Applications –The StakeholdersNLP Applications – The Stakeholders WHO ARE THE STAKEHOLDERS ? Natural Language Processing Applications like: Stemming Machine Translation Speech Recognition Information Retrieval 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW WHY ARE THESE APPLICATION THE STAKEHOLDERS ? 3
  • 5.
    The problem athandThe problem at hand Morphological Analysis of Tamil involves understanding the word structure and its inflections AGGLUTINATION IN TAMIL Agglutination is the morphological process of adding affixes to the base of a word Typical Tamil verb form will have a number of suffixes showing person, number, mood, tense and voice. INFLECTIONS IN TAMIL 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW பால் - Gender எண் - Number திைண - Class காலம் - Tenseஇடம் - Person 4
  • 6.
    The problem athandThe problem at hand Morphological Analysis of Tamil involves understanding the word structure and its inflections AGGLUTINATION IN TAMIL Agglutination is the morphological process of adding affixes to the base of a word Typical Tamil verb form will have a number of suffixes showing person, number, mood, tense and voice. INFLECTIONS IN TAMIL Example: vAlntukkontirunt :̣ ̣ ̣ ēṉ [வாழ்ந்துகொகாண்டிருந்ேதேன்] 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW vAḷ - வாழ் intu - ந்துக kontụ ̣ - ொகாண்ட irunta - இருந்தே ēn - ஏன் root voice marker tense marker aspect marker person marker live past tense object voice during past progressive first person, Singular 4
  • 7.
    The proposed solutionTheproposed solution There are two levels called lexical and surface levels. In the surface level, a word is represented in its original orthographic form. In the lexical level, a word is represented by denoting all of the functional components of the word. RULE – BASED MORPHOLOGICAL ANALYSIS Analyzing word inflections using rules specified in Tamil Grammar அன் ஆன் அள் ஆள் அர் ஆர் பமமார் அஆ கடதுகற என் ஏன் அல் அன் அம் ஆம் எம் ஏம் ஓொமா டமமர் கடதேற ஐ ஆய் இமமின் இர்ஈர் ஈயர் கயவ ொமனபவம் பிறவம் விைனயின் விகதேி ொபயாினம் சிலேவ 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW5 SURFACE LEVEL LEXICAL LEVEL நனனல் ொதோலகாபபியம்
  • 8.
    The proposed solutionTheproposed solution MACHINE LEARNING APPROACH 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW6 While checking for suffixes in a given word, more than one suffix might be possible, if the rules are strictly followed. But only one suffix is semantically possible. விகதேி : படிததுக – “ ”உ படிததேதுக – “ ”துக or “ ”உ ??? M/L approach helps the system in “learning” the correct parsing method for the word, and in the subsequent processing of the same word, the wrong possibilities are automatically eliminated. 1 Two words might share the same inflectional part. நடககினறான் படிககினறான் The inflectional part of every word is learnt by the system. This helps in optimization by eliminating the need to analyse the second word again from scratch 2
  • 9.
    Where does itall fit in ?Where does it all fit in ? 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW7 Characters Word – Tokenization Morphological Analysis Sentence Syntax Analysis Semantic Analysis ப டி தே் தோ ன் படிததோன் படி - ததே் - ஆன் அவன் பததேகதைதேப் படிததோன் Meaning of the sentence ???
  • 10.
    Need for TamilMorphological AnalysisNeed for Tamil Morphological Analysis ENGLISH vs. TAMIL TRANSLATION AND SEMANTIC ANALYSIS அவன் மதுகைரகக வந்தோள் -- Semantically Wrong To check semantic correctness of a sentence, morphological analysis is needed. How to translate the above sentence ?? 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW8 I came நான் வந்ேதேன் You came நீ வந்தோய் They came அவர்கள் வந்தேனர் He came அவன் வந்தோன் She came அவள் வந்தோள்
  • 11.
    Resources ObtainedResources Obtained EMILLE– CIIL TAMIL MONOLINGUAL CORPUS Enabling Minority Language Engineering Collaborative Venture of ◦ Lancaster University, UK ◦ Central Institute of Indian Languages (CIIL), Mysore, India Distributed by European Language Resources Association [ELRA] TAMIL WORDNET The database is a semantic dictionary that is designed as a lexical network Developed by ◦ Department of Linguistics ofTamil University ◦ AU-KBC Research Centre, Chennai Tamil Wordnet resembles a traditional dictionary. It also contains valuable information about morphologically related words 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW9
  • 12.
    Implementation Details -1Implementation Details - 1 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW10 Input Tamil Word Check in DB C-V Segmentation Root verb ? Backward Scanning of inflections Classify and Remove Inflection Output Conflict Resolution Machine Learning No YesYes No
  • 13.
    Implementation Details -2Implementation Details - 2 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW11 படததான் ப ட த் தா ன் ப் - அ ட் - இ த் த் - ஆ ன் ப் அ ட் இ த் த் ஆ ன் பட < VERB_ROOT > தத் < PAST TENSE > ஆன் < 3SM >
  • 14.
    Implementation Details -3Implementation Details - 3 UNICODE SUPPORT FOR TAMIL U+0B80 – U+0BFF GOOGLE TAMIL TRANSLITERATOR IME (Input Method) Google Transliteration IME is an input method editor which allows users to enter text Tamil using a roman keyboard PROGRAMMING LANGUAGE Java DATABASES MySQL Databases, with JDBC to access the database 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW12
  • 15.
    Implementation Details -3Implementation Details - 3 TRANSLITERATION MODULE A simple Transliterator module - to enable conversion from Tamil to English and vice-versa Example: ◦ அ - a ◦ ஆ - aa ◦ க - ka HASH TABLE GENERATOR The application uses two data files, containing a list of vigudhi and idainilai. The Java Hash Generator Code loads the data from the workbooks, adds them to a hash table, and serializes the data and outputs to an external data file, which can be loaded whenever the application requires access. 30/01/15 National Institute of Technology, Tiruchirappalli WHO WHAT WHYWHERE HOW13
  • 16.
    Future ScopeFuture Scope The algorithm can be extended to cover nouns and noun forms too.  The algorithm can be improved to incorporate stricter rules so as to reduce conflicts that arise in the output generated by the current system.  The algorithm can be extended for other agglutinative languages.  The various resources obtained as a part of this project, including the EMILLE-CIIL ELRA Corpus, the Tamil Wordnet Database and other tools can be used for further study, research and development in the field of Natural Language Processing at our college in the years to come. 30/01/15 National Institute of Technology, Tiruchirappalli 14
  • 17.
    ReferencesReferences  A NovelApproach to Morphological Analysis forTamil Language ◦ Anand kumar M1, DhanalakshmiV1, Rajendran S2, Soman K P  Nannool and Tholkaapiyam ◦ Tamil Grammar texts  The Morphological Generator and Parsing Engine forTamilVerb Forms. ◦ Ultimate Software Solution, Dindigul  Morphological Analyzer forTamil ◦ Anandan. P, Ranjani Parthasarathy, Geetha T.V. [2002] ◦ ICON 2002, RCILTS-Tamil,Anna University, India.  Morphology.A Handbook on Inflection andWord Formation ◦ Daelemans Walter, G. Booij, Ch. Lehmann, and J. Mugdan (eds.) [2004]  Tamil Part-of-Speech tagger based on SVMTool ◦ DhanalakshmiV,Anandkumar M,Vijaya M.S, Loganathan R, Soman K.P, Rajendran S [2008] ◦ Proceedings of the COLIPS International Conference on Asian Language Processing 2008 (IALP).  Unsupervised Learning of the Morphology of a Natural Language. ◦ John Goldsmith. [2001] ◦ Computational Linguistics, 27(2):153–198.  Computational morphology of verbal complex ◦ Rajendran, S.,Arulmozi, S., Ramesh Kumar,Viswanathan, S. [2001] ◦ Paper read in Conference at Dravidan University, Kuppam, December 26-29, 2001. 30/01/15 National Institute of Technology, Tiruchirappalli 15
  • 18.
    Thank youThank you 30/01/15National Institute of Technology, Tiruchirappalli