SlideShare a Scribd company logo
1 of 23
Download to read offline
1
Abstract
Anusaaraka is an English – Hindi language accessing software. With insights from
Panini's Ashtadhyayi (Grammar rules), Anusaaraka is a machine translation tool
being developed by the Chinmaya International Foundation (CIF), International
Institute of Information Technology, Hyderabad (IIIT -H) and University of
Hyderabad (Department of Sanskrit Studies). Fusion of traditional Indian shastras
and advanced modern technologies is what Anusaaraka is all about.
Anusaaraka allows users to access text in any Indian language, after
translation from the source language (i.e. English or any other regional Indian
language). In today's Information Age large volumes of information is available in
English – whether it be information for competitive exams or even general reading.
However, a lot of the educated masses whose primary language is Hindi or a
regional Indian language are unable to access information in English. Anusaaraka
aims to bridge this language barrier by allowing a user to enter an Englis h text
into Anusaaraka and get the translation of the same in an Indian language. The
Anusaaraka being referred to here has English as the source language and Hindi as
the target language.
Anusaaraka derives its name from the Sanskrit word ‘ Anusaran’ which
means ‘to follow’. It is so called, as the translated Anusaaraka output appears in
layers – i.e. a sequence of steps that follow each other till the final translation is
displayed to the user.
2
International Institute of Information
Technology (IIIT), Hyderabad
The International Institute of Information Technology, Hyderabad (IIIT -H) is an
autonomous university founded in 1998. It was set up as a not -for-profit public
private partnership (NPPP) and is the first IIIT to be set up (under this mo del) in
India. The Government of Andhra Pradesh lent support to the institute by grant of
land and buildings. A Governing Council consisting of eminent people from
academia, industry and government presides over the governance of the institution.
IIIT-H was set up as a research university focused on the core areas of
Information Technology, such as Computer Science, Electronics and
Communications, and their applications in other domains. The institute evolved
strong research programs in a host of areas, with computation or IT providing the
connecting thread, and with an emphasis on the development of technology and
applications, which can be transferred for use to industry and society. This
required carrying out basic research that ca n be used to solve real life problems.
As a result, a synergistic relationship has come to exist at the Institute between
basic and applied research. Faculty carries out a number of academic industrial
projects, and a few companies have been incubated base d on the research done at
the Institute.
IIIT-H is organized as research centers and labs, instead of the
conventional departments, to facilitate inter -disciplinary research and a seamless
flow of knowledge within the Institute. Faculty assigned to the ce nters and labs
conduct research, as well as academic programs, which are owned by the Institute,
and not by individual research centers.
3
Machine Translation
Machine Translation is an important technology for localization, and is
particularly relevant in a linguistically diverse c ountry like India. Human
translation in India is a rich and ancient tradition. Works of philosophy, arts,
mythology, religion, science and folklore have been translated among the ancient
and modern Indian languages. Numerous cl assic works of art, ancient, medieval
and modern, have also been translated between European and Indian languages
since the 18
t h
century. In the current era, human translation finds application
mainly in the administration, media and education, and to a l esser extent, in
business, arts and science and technology. India has a linguistically rich area —it
has 18 constitutional languages, which are written in 10 different scripts. Hindi is
the official language of the Union. English is very widely used in the media,
commerce, science and technology and education. Many of the states have their
own regional language, which is either Hindi or one of the other constitutional
languages. Only about 5% of the population speaks English. In such a situation,
there is a big market for translation between English and the various Indian
languages. Currently, this translation is essentially manual. Use of automation is
largely restricted to word processing. Two specific examples of high volume
manual translation are—translation of news from English into local languages,
translation of annual reports of government departments and public sector units
among, English, Hindi and the local language.
As is clear from above, the market is largest for translation from English
into Indian languages, primarily Hindi. Hence, it is no surprise that a majority of
the Indian Machine Translation (MT) systems are for English-Hindi translation.
Natural language processing presents many challenges, of which the biggest is the
inherent ambiguity of natural language. MT systems have to deal with ambiguity,
and various other NL phenomena. In addition, the linguistic diversity between the
source and target language mak es MT a bigger challenge. This is particularly true
of widely divergent languages such as English and Indian languages. The major
structural difference between English and Indian languages can be summarized as
4
follows. English is a highly positional langu age with rudimentary morphology,
and default sentence structure. Indian languages are highly inflectional, with a rich
morphology, relatively free word order, and d efault sentence structure. In addition,
there are many stylistic differences. For example, i t is common to see very long
sentences in English, using abstract concepts as the subjects of sentences, and
stringing several clauses together (as in this sentence!). Such constructions are not
natural in Indian languages, and present major difficulties in producing good
translations.
As is recognized the world over, with the current state of art in MT, it is
not possible to have Fully Automatic, High Quality, and General -Purpose Machine
Translation. Practical systems need to handle ambiguity and the other complexities
of natural language processing, by relaxing one or more of the ab ove dimensions.
Thus, we can have automatic high -quality ‘sub-language’ systems for specific
domains, or automatic general -purpose systems giving rough translation, or
interactive general-purpose systems with pre or post ed iting.
Why Machine Translation?
Today technology has made it possible for individuals worldwide to access large
volumes of information at the click of a button. However, very often the
information sought may not be in a language that the individual is familiar with.
Thus, Machine Translation is an endeavor to minimize the language barrier , by
making it possible to access a text i n the language of one's choice. For technology
to be able to provide the above fac ility, many aspects of language are involved.
To name a few:
•Script
•Spelling
•Vocabulary
•Morphology
•Syntax
5
Keeping the above in mind, m achine translation systems need to be
equipped to translate a text within seconds and yet capture the information of the
text to the best possible extent.
6
Anusaaraka
The focus in Anusaaraka is not mainly on machine translation, but on Language
Access between Indian languages. Using principles of Paninian Grammar (PG), and
exploiting the close similarity of Indian languages, Anusaaraka essentially maps
local word groups between the source and target languages. Where there are
differences between the languages, the system introduces extra notation to
preserve the information of the source language. Thus, the user needs some
training to understand the output of the system. The project has developed
Language Accessors from many Indian langua ges into Hindi.
Anusaaraka maps constructions in the source language to the
corresponding constructions in the target language wherever possible. For
example, a noun or pronoun in the source language is mapped to an appropriate
noun or pronoun, respectively, in the target language as shown below:
@H: Apa pustaka paDha_raHA_[HE|thA]_kyA{23_ba.}?
!E: You book read_ing_[is|was] Q.?
E: Are/were you reading a book?
(Where the prefixes mean the following:
@H=anusaaraka Hindi, !E=English gloss, E=Engli sh.)
In the example above, the last wor d in the sentence is a verb and illustrates the
mapping morpheme by morpheme: the root is mapped to 'paDha' (read), and
similarly the tense-aspect-modality (TAM) label is mapped to 'raHA_[HE|thA]'
(is_*ing or was_*ing), which is followed by 'A' suffix which gets mapped to 'kyA'
(what) as a question mark in Hindi. Gender, number, and person (GNP) information
is also shown separately in curly bra ckets ('{23_ba.}' for second or third person
and plural).
7
Sometimes, for a construction in the source language, the same
construction is not available in the target langu age. In such a case, the system
chooses another construction in the ta rget language in which the same information
can be expressed. In the example below, the system choses the complementizer
construction in Hindi (EsA) to express the same sense:
@H: hamArA_ ladakI_ko` nOkarI karanA_EsA nahIM_[hE|WA].
!E: Our daughter (dat.) job do_should_that not (fem.)
E: It is not the case that our daughter should get a job.
However, Anusaaraka shows the image and therefore, it uses the complementizer
(EsA). Sometimes there are slight difference s between a construction in the source
language to a similar const ruction in the target language because of which
information might not be preserved. In such a situation additional notati on is
introduced to express the information which would otherwise get lost. A simple
example of this is the lack of distinction between personal pronoun and pronominal
adjective in Hindi: vaha.
@H: vaha` pAThshAlA_ko` gayA.
!E: he school (dat.) went.
E: He went to school.
@H: vaha- pAThshAlA_ko` TrophI AyI.
!E: that school (dat.) trophy came
E: That school received the trophy.
When transferring from one language to the other , this distinction would have
disappeared, if care was not taken. In Anusaaraka, the two forms are made
different by introducing additional notation:
vaha` (he)
vaha- (that)
8
Salient Features of Anusaaraka
Faithful representation of text in source language:
Throughout the various layers of Anusaaraka outp ut there is an effort to ensure
that the user should be able to understand the information contained in the English
sentence. This is given greater importance than giving perfect sentences in Hindi,
for it would be pointless to have a translation that reads well but does not truly
capture the information of the source text.
The layered output is unique to Anusaaraka. Thus, source language text
information and how the Hindi translation is finally arrived at can be accessed by
the user. The important feature of the layered output is that the information
transfer is done in a controlled manner at every step thus, making it possible to
revert back without any loss of information. Also, any loss of information t hat
cannot be avoided in a translation process is then done in a gradual way.
Therefore, even if the translated sentence is not as 'perfect' as human translation,
with some effort and orientation on reading Anusaaraka output, an individual can
understand what the source text is implying by looking at the layers and context in
which that sentence appears.
Reversibility:
The feature of gradual transference of information from one layer to the next,
gives Anusaaraka an additional advantage of bringing rever sibility in the
translation process – a feature which cannot be achieved by a conventional
machine translation system. A bi -lingual user of Anusaaraka can, at any point,
access the source language text in English, because of the transparency in the
output. Some amount of orientation on how to read the Anusaaraka output would be
required for this.
9
Transparency:
Display of step-by-step translation layers gives an increased level of confidence to
the end-user, as he can trace back to the source and get clar ity regarding translated
text by analysis of the output layers and some reference to context.
10
Champollion
Champollion is a Robust Parallel Text Sentence Aligner . Parallel text is a very
valuable resource for a number of natural language processing tasks, including
machine translation, cross language information retri eval, and word
disambiguation. Parallel text provides the maximum utility when it is sentence
aligned. The sentence alignment process maps sentences in the source tex t to their
translation. The labo ur intensive and time consuming nature of manual sentence
alignment makes large parallel text corpus development difficult. Thus a number of
automatic sentence alignment approaches have been proposed and utilized; some
are pure length based approaches, some are lexicon based, and some are a mixture
of the two approaches.
While existing approaches perform reasonably well on close language
pairs, their performance degrades quickly on remote language pairs such as English
and Chinese. Performance degradation is exace rbated by noise in the data.
Champollion was initially developed for aligning Chinese -English
parallel text. It was later ported to other language pairs, including Arabic –
English and Hindi – English.
Champollion differs from other sentence aligners in two ways. First, it
assumes a noisy input, i.e. a large percentage of alignments will not be one to one
alignments, and that the number of deletions and insertions will be significant. The
assumption is against declaring a match in the absence of lexical evidence. Non -
lexical measures, such as sentence length information – which are often unreliable
when dealing with noisy data – can and should still be used, but they should only
play a supporting role when lexical evidence is present. Second, Champollion
differs from other lexicon-based approaches in assignin g weights to translated
words. Translation lexicons usually help sentence aligners in the following way:
first, translated words are identified by usi ng entries from a translation lexicon;
11
second, statistics of translated words are then used to identify sentence
correspondences.
In most existing sentence alignment algorithms, translated words are
treated equally, i.e. translated word pairs are assigned equal weight when deci ding
sentence correspondences. For example, 1-1 alignment constitutes 89% of the UBS
English-French corpus and 1-0 and 0-1 alignments constitute merely 1.3%.
However, when creating very large parallel corpora, the data can be very no isy.
For example, in a UN Chin ese English corpus, 6.4% of all alignments are either 1 -
0 or 0-1 alignment.
Some of the omissions and insertions were introduced during the
translation of the text. Most of the omissions and insertions, however, are
introduced during different stages of processing before sentence alignment is
carried out. The pre-processing steps include converting the raw data to plain text
format, removing tables, foot notes, end notes, etc. Most of these steps introduce
noise. For instance, while a table in an English document can be completely
removed, this is not necessarily the case in any given Chinese document. Because
of the sheer number of documents involved, manually examining each document
after pre-processing is impossible. A robus t sentence aligner needs not only to
detect most categories of noise, but also to recover quickly if an error is made. It
has been proved that existing methods work very well on clean data, but their
performance goes down quickly as data becomes noisy.
12
CODES
Code for extracting regular text from xml file:
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
//MAXIMUM NUMBER OF PAGES ALLOWED
#define MAX 200
//EXTENSION OF THE FILES BEING CREATED FOR EACH PAGE
#define EXTENSION ".xml"
//LENGTH OF THE EXTENSION OF THE FILE
#define EXTENSION_LENGTH strlen(EXTENSION)
char temp[MAX];
//EXACT NUMBER OF PAGES IN THE SOURCE XML FILE
int totalPages;
//CONTAINS THE CURRENT PAGE NUMBER CONVERTED TO ITS CORRESPONDING FILENAME
char pageNumber[20];
//FILE POINTERS FOR READING THE PAGE FILE AND WRITING TO FINAL TEXT FILES
//TWO TEXT FILES ARE CREATED
//ONE FOR NON SORTED AND THE OTHER FOR SORTED DATA ACCORDING TO CO-
ORDINATES OF THE TEXT ON THE PAGE
FILE *fr,*fw;
//STRUCTURE FOR THE CONTENTS OF A SINGLE LINE OF THE XML FILE
struct Line
{
int top;
int left;
int width;
int height;
int font;
char text[10000];
};
//STRUCTURE FOR THE CONTENTS OF A SINGLE PAGE OF XML FILE
struct Page
{
struct Line line[MAX];
int lines;
};
//STRUCTURE FOR THE PAGE HEADER
struct Header
{
int fontId;
char fontSize[10];
char color[10];
struct Header *link;
};
typedef struct Header* HEADER;
struct Page pages[MAX];
HEADER head;
//CONTAINS THE FONTS FOR WHICH THE TEXT IS TO BE EXTRACTED
int fonts[MAX];
//CONTAINS TOTAL NUMBER OF FONTS
int totalFonts;
HEADER getHeader()
{
return((HEADER)malloc(1*sizeof(struct Header)));
}
13
void generatePages(char arg[50])
{
char arr[100]="./genPages.out ";
strcat(arr,arg);
printf("Creating Pagesn");
system("cc genPages.c -o genPages.out");
system(arr);
printf("Pages createdn");
}
void convertToText(int page)
{
int l,i,j;
char rev[20];
i=0;
while(page!=0)
{
rev[i++]=(page%10)+48;
page=page/10;
}
l=i;
i--;
for(j=0;j<l;j++)
{
pageNumber[j]=rev[i--];
}
for(i=0;i<EXTENSION_LENGTH;i++)
pageNumber[i+l]=EXTENSION[i];
pageNumber[i+l]='0';
}
void fetchHeader()
{
int c,i;
HEADER t,cur;
while(1)
{
c=getc(fr);
if(c=='<')
{
c=getc(fr);
if(c=='f')
{
t=getHeader();
t->link=NULL;
while(!isdigit(c=getc(fr)));
i=0;
while(isdigit(c))
{
temp[i++]=c;
c=getc(fr);
}
temp[i]='0';
t->fontId=atoi(temp);
while(!isdigit(c=getc(fr)));
i=0;
while(isdigit(c))
{
14
t->fontSize[i++]=c;
c=getc(fr);
}
t->fontSize[i]='0';
while((c=getc(fr))!='#');
c=getc(fr);
i=0;
while(c!='"')
{
t->color[i++]=c;
c=getc(fr);
}
t->color[i]='0';
if(head==NULL)
{
head=t;
}
else
{
cur=head;
while(cur->link!=NULL)
cur=cur->link;
cur->link=t;
}
while(getc(fr)!='>');
}
else
break;
}
}
}
int checkLineEnd()
{
int c,i;
i=0;
while((c=getc(fr))!='>')
temp[i++]=c;
temp[i]='0';
if(strcmp(temp,"/text")==0)
return(1);
return(0);
}
void fetchText(int pgNo)
{
int c,i;
i=0;
while(1)
{
c=getc(fr);
if(c=='<')
{
if(checkLineEnd())
break;
else
15
continue;
}
pages[pgNo].line[pages[pgNo].lines].text[i++]=c;
}
pages[pgNo].line[pages[pgNo].lines].text[i]='0';
}
void fetchPageInfo(int pgNo)
{
int c,i;
c=getc(fr);
while(c!=EOF)
{
while(!isdigit(c=getc(fr)));
i=0;
while(isdigit(c))
{
temp[i++]=c;
c=getc(fr);
}
temp[i]='0';
pages[pgNo].line[pages[pgNo].lines].top=atoi(temp);
while(!isdigit(c=getc(fr)));
i=0;
while(isdigit(c))
{
temp[i++]=c;
c=getc(fr);
}
temp[i]='0';
pages[pgNo].line[pages[pgNo].lines].left=atoi(temp);
while(!isdigit(c=getc(fr)));
i=0;
while(isdigit(c))
{
temp[i++]=c;
c=getc(fr);
}
temp[i]='0';
pages[pgNo].line[pages[pgNo].lines].width=atoi(temp);
while(!isdigit(c=getc(fr)));
i=0;
while(isdigit(c))
{
temp[i++]=c;
c=getc(fr);
}
temp[i]='0';
pages[pgNo].line[pages[pgNo].lines].height=atoi(temp);
while(!isdigit(c=getc(fr)));
16
i=0;
while(isdigit(c))
{
temp[i++]=c;
c=getc(fr);
}
temp[i]='0';
pages[pgNo].line[pages[pgNo].lines].font=atoi(temp);
printf("Fetching text for line %dn",pages[pgNo].lines);
c=getc(fr);
fetchText(pgNo);
pages[pgNo].lines++;
c=getc(fr);
c=getc(fr);
}
}
void fetchFontId(int argc,char *argv[])
{
int i;
HEADER cur;
for(i=3;i<argc-1;i=i+2)
{
cur=head;
while(cur!=NULL)
{
if((strcmp(argv[i],cur-
>fontSize)==0)&&(strcmp(argv[i+1],cur->color)==0))
{
fonts[totalFonts++]=cur->fontId;
}
cur=cur->link;
}
}
}
void createPages()
{
int i;
for(i=1;i<=totalPages;i++)
{
convertToText(i);
fr=fopen(pageNumber,"r");
if(fr==NULL)
{
printf("Cannot open the file %snExittingn",pageNumber);
exit(0);
}
printf("Fetching information of Page %dn",i);
fetchHeader();
pages[i].lines=0;
fetchPageInfo(i);
printf("Information of Page %d fetchedn",i);
fclose(fr);
}
}
int checkFont(int fnt)
{
int i;
17
for(i=0;i<totalFonts;i++)
{
if(fnt==fonts[i])
return(1);
}
return(0);
}
void sortPage(int pgNo)
{
struct Line temp;
int i,j;
for(i=0;i<pages[pgNo].lines-1;i++)
{
for(j=i+1;j<pages[pgNo].lines;j++)
{
if(pages[pgNo].line[i].top>=pages[pgNo].line[j].top)
{
if(pages[pgNo].line[i].top==pages[pgNo].line[j].top)
{
if(pages[pgNo].line[i].left>pages[pgNo].line[j].left)
{
temp=pages[pgNo].line[i];
pages[pgNo].line[i]=pages[pgNo].line[j];
pages[pgNo].line[j]=temp;
}
}
else
{
temp=pages[pgNo].line[i];
pages[pgNo].line[i]=pages[pgNo].line[j];
pages[pgNo].line[j]=temp;
}
}
}
}
}
void writeText(int pgNo)
{
int i;
for(i=0;i<pages[pgNo].lines;i++)
{
if(checkFont(pages[pgNo].line[i].font))
{
fputs(pages[pgNo].line[i].text,fw);
putc('n',fw);
}
}
}
void createTextFile(char arg[MAX])
{
int i;
for(i=1;i<=totalPages;i++)
{
writeText(i);
}
fclose(fw);
strcat(arg,"_sorted");
fw=fopen(arg,"w");
18
if(fw==NULL)
{
printf("Cannot create the file %snEXITINGn",arg);
return;
}
for(i=1;i<=totalPages;i++)
{
sortPage(i);
writeText(i);
}
}
main(int argc,char *argv[])
{
totalPages=atoi(argv[2]);
generatePages(argv[1]);
head=NULL;
createPages();
totalFonts=0;
fetchFontId(argc,argv);
fw=fopen(argv[argc-1],"w");
if(fw==NULL)
{
printf("Cannot create the file %snEXITINGn",argv[argc-1]);
return(0);
}
createTextFile(argv[argc-1]);
fclose(fw);
return(0);
}
Code for dividing the xml into pages in accordance with the .pdf file used:
#include<stdio.h>
#include<string.h>
#define MAX 2000
#define START_PATTERN "<page"
#define START_PATTERN_LENGTH strlen(START_PATTERN)
#define END_PATTERN "</page>"
#define END_PATTERN_LENGTH strlen(END_PATTERN)
#define EXTENSION ".xml"
#define EXTENSION_LENGTH strlen(EXTENSION)
FILE *fr,*fw;
char temp[MAX];
char pageNumber[20];
void convertToText(int page)
{
int l,i,j;
char rev[20];
i=0;
while(page!=0)
{
rev[i++]=(page%10)+48;
page=page/10;
}
19
l=i;
i--;
for(j=0;j<l;j++)
{
pageNumber[j]=rev[i--];
}
for(i=0;i<EXTENSION_LENGTH;i++)
pageNumber[i+l]=EXTENSION[i];
pageNumber[i+l]='0';
}
int skip()
{
int i,c;
for(i=0;i<START_PATTERN_LENGTH;i++)
{
c=getc(fr);
if(c==EOF)
return(EOF);
temp[i]=c;
}
temp[i]='0';
do
{
if(strcmp(temp,START_PATTERN)==0)
{
c=getc(fr);
while(c!='>')
c=getc(fr);
c=getc(fr);
return(1);
}
c=getc(fr);
if(c==EOF)
return(EOF);
for(i=0;i<START_PATTERN_LENGTH-1;i++)
temp[i]=temp[i+1];
temp[i++]=c;
temp[i]='0';
}while(1);
}
int checkPageEnd()
{
int i;
for(i=0;i<END_PATTERN_LENGTH;i++)
{
temp[i]=getc(fr);
}
temp[i]='0';
if(strcmp(temp,END_PATTERN)==0)
return(1);
return(0);
}
main(int argc,char *argv[])
{
int c,r,i,page;
page=1;
fr=fopen(argv[1],"r");
if(fr==NULL)
{
printf("Cannot open %sn",argv[1]);
return(0);
20
}
do
{
if(skip()==EOF)
break;
convertToText(page);
fw=fopen(pageNumber,"w");
if(fw==NULL)
{
printf("Cannot create file %sn",pageNumber);
return(0);
}
else
printf("File for Page Number %s createdn",pageNumber);
do
{
c=getc(fr);
if(c=='<')
{
ungetc(c,fr);
r=checkPageEnd();
if(r)
break;
//putc('<',fw);
for(i=0;temp[i]!='0';i++)
putc(temp[i],fw);
}
else
{
putc(c,fw);
}
}while(1);
fclose(fw);
page++;
}while(c!=EOF);
fclose(fr);
return(0);
}
21
Word-Sense Disambiguation
(WSD)
In computational linguistics , word-sense disambiguation (WSD) is an open problem
of natural language processing , which governs the process of identifying which
sense of a word (i.e. meaning) is used in a sentence, when the word has multiple
meanings. The solution to this problem impacts other computer -related writing,
such as discourse, improving relevance of search engines, anaphora resolution,
coherence. A disambiguation process requires two strict things: a dictionary to
specify the senses which are to be disambiguated and a corpus of language data to
be disambiguated (in some methods, a training corpus of language examples is also
required). WSD task has two variants: " lexical sample" and "all words" task. The
former comprises disambiguating the occurrences of a small sample of target words
which were previously selecte d, while in the latter all the words in a piece of
running text need to be disambiguated. The latter is deemed a more realistic form
of evaluation, but the corpus is more expensive to produce because human
annotators have to read the definitions for each w ord in the sequence every time
they need to make a tagging judgement, rather than once for a block of instances
for the same target word.
To give a hint how all this works, consider two examples of the distinct
senses that exist for the (written) word " bass":
 a type of fish
 tones of low frequency
and the sentences:
 I went fishing for some sea bass.
 The bass line of the song is too weak.
22
To a human, it is obvious that the first sentence is using the word " bass
(fish)", as in the former sense above and in the second sentence, the word " bass
(instrument)" is being used as in the latter sense below. Developing algorithms to
replicate this human ability can often be a difficult task, as is further exemplified
by the implicit equivocation between " bass (sound)" and "bass" (musical
instrument).
C Language Integrated Production System:
CLIPS is an expert system tool originally developed by the Software
Technology Branch (STB), NASA/Lyndon B. Johnson Space Center. Since its first
release in 1986, CLIPS has undergone continual refinement and improvement. It is
now used by thousands of people around the world. CLIPS is designed to facilitate
the development of software to model human knowledge or expertise. There are
three ways to represent knowledge in CLIPS:
• Rules, which are primarily intended f or heuristic knowledge based on experience.
• Deffunctions and generic functions, which are primarily intended for procedural
knowledge.
• Object-oriented programming , also primarily intended for procedural knowledge.
The five generally accep ted features of object -oriented programming are
supported: classes, message-handlers, abstraction, encapsulation, inheritance,
polymorphism. Rules may pattern match on objects and facts.
We can develop software using only rules, only objects, or a mixture of
objects and rules. CLIPS has also been designed for int egration with other
languages such as C and Java. Rules and objects form an integrated system too
since rules can pattern-match on facts and objects. In addition to being used as a
stand-alone tool, CLIPS can be called from a procedural language, perform i ts
function, and then return control back to the calli ng program. Likewise, procedural
code can be defined as external functions and called from CLIPS. When the
external code completes execu tion, control returns to CLIPS. CLIPS is an excellent
tool for word-sense disambiguation .
23
Conclusion
MT is relatively new in India – about a decade old. In comparison with MT efforts
in Europe and Japan, which are at least 3 decades old, it would seem that Indian
MT has a long way to go. However, this can also be an advantage, because Indian
researchers can learn from the experience of their global counterparts. There are
close to a dozen projects now, with about 6 of them being in advanced prototype or
technology transfer stage, and the rest having been newly initiated.
The Indian NLP/MT scene so far has been characterized by an acute
scarcity of basic lexical resources such as corpora, MRDs, lexicons, thesauri and
terminology banks. Also, the various MT groups have used different formalisms
best suite to their specific a pplications, and hence there has been little sh aring of
resources among them. These issues are being addressed now. There are
governmental as well as voluntary efforts under way to develop common lexical
resources, and to create forums for consolidating an d coordinating NLP and MT
efforts. It appears that the exploratory phase of Indian MT is over, and the
consolidation phase is about to begin, with the focus moving from proof -of-
concept prototypes to productionization, deployment, collaborative resource
sharing and evaluation.
The core Anusaaraka output is in a language close to the target
language, and can be understood by the hu man reader after some training. The
question is how much training is nece ssary to get a very high degree of
comprehension. Our experience of working among Indian languages shows that this
training is likely to be small. Re ason for this is that India forms a linguistic area:
Indian languages share vocabulary and grammatical constructions. There are also
shared pragmatics and culture . Similar approach can be applied to build English to
Hindi Anusaaraka. A study can be conducted related to tr aining required to read
such an output. The expectation is that English to Hindi usable system can be built
except that it will require longer training.

More Related Content

What's hot

MACHINE TRANSLATION WITH SPECIAL REFERENCE TO MALAYALAM LANGUAGE
MACHINE TRANSLATION WITH SPECIAL REFERENCE TO MALAYALAM LANGUAGEMACHINE TRANSLATION WITH SPECIAL REFERENCE TO MALAYALAM LANGUAGE
MACHINE TRANSLATION WITH SPECIAL REFERENCE TO MALAYALAM LANGUAGEJomy Jose
 
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONA ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
 
Marathi Text-To-Speech Synthesis using Natural Language Processing
Marathi Text-To-Speech Synthesis using Natural Language ProcessingMarathi Text-To-Speech Synthesis using Natural Language Processing
Marathi Text-To-Speech Synthesis using Natural Language Processingiosrjce
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
 
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALDICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALcsandit
 
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
 
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation SystemInterpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation Systemijtsrd
 
Transliteration by orthography or phonology for hindi and marathi to english ...
Transliteration by orthography or phonology for hindi and marathi to english ...Transliteration by orthography or phonology for hindi and marathi to english ...
Transliteration by orthography or phonology for hindi and marathi to english ...ijnlc
 
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITIONISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITIONijnlc
 
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
 
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
Contextual Analysis for Middle Eastern Languages with Hidden Markov ModelsContextual Analysis for Middle Eastern Languages with Hidden Markov Models
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
 
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTSSENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTSijnlc
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
 
Hindi –tamil text translation
Hindi –tamil text translationHindi –tamil text translation
Hindi –tamil text translationVaibhav Agarwal
 
Kannada Phonemes to Speech Dictionary: Statistical Approach
Kannada Phonemes to Speech Dictionary: Statistical ApproachKannada Phonemes to Speech Dictionary: Statistical Approach
Kannada Phonemes to Speech Dictionary: Statistical ApproachIJERA Editor
 

What's hot (17)

MACHINE TRANSLATION WITH SPECIAL REFERENCE TO MALAYALAM LANGUAGE
MACHINE TRANSLATION WITH SPECIAL REFERENCE TO MALAYALAM LANGUAGEMACHINE TRANSLATION WITH SPECIAL REFERENCE TO MALAYALAM LANGUAGE
MACHINE TRANSLATION WITH SPECIAL REFERENCE TO MALAYALAM LANGUAGE
 
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONA ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
 
Marathi Text-To-Speech Synthesis using Natural Language Processing
Marathi Text-To-Speech Synthesis using Natural Language ProcessingMarathi Text-To-Speech Synthesis using Natural Language Processing
Marathi Text-To-Speech Synthesis using Natural Language Processing
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALDICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
 
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
 
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation SystemInterpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System
 
Transliteration by orthography or phonology for hindi and marathi to english ...
Transliteration by orthography or phonology for hindi and marathi to english ...Transliteration by orthography or phonology for hindi and marathi to english ...
Transliteration by orthography or phonology for hindi and marathi to english ...
 
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITIONISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
 
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
 
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
Contextual Analysis for Middle Eastern Languages with Hidden Markov ModelsContextual Analysis for Middle Eastern Languages with Hidden Markov Models
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
 
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTSSENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
 
**JUNK** (no subject)
**JUNK** (no subject)**JUNK** (no subject)
**JUNK** (no subject)
 
Hindi –tamil text translation
Hindi –tamil text translationHindi –tamil text translation
Hindi –tamil text translation
 
Kannada Phonemes to Speech Dictionary: Statistical Approach
Kannada Phonemes to Speech Dictionary: Statistical ApproachKannada Phonemes to Speech Dictionary: Statistical Approach
Kannada Phonemes to Speech Dictionary: Statistical Approach
 

Viewers also liked

2016 Presidential Candidate Tracker
2016 Presidential Candidate Tracker2016 Presidential Candidate Tracker
2016 Presidential Candidate TrackerAnwar Jameel
 
How to Bid bandwidth
How to Bid bandwidthHow to Bid bandwidth
How to Bid bandwidthAnwar Jameel
 
الجبائر السنية التجميلية - الجزء الثاني
الجبائر السنية التجميلية - الجزء الثانيالجبائر السنية التجميلية - الجزء الثاني
الجبائر السنية التجميلية - الجزء الثانيBassem Abu Canon , DDS
 
Algorithm Visualizer
Algorithm VisualizerAlgorithm Visualizer
Algorithm VisualizerAnwar Jameel
 
Image Classification
Image ClassificationImage Classification
Image ClassificationAnwar Jameel
 
Object Tracking using Artificial Neural Network
Object Tracking using Artificial Neural NetworkObject Tracking using Artificial Neural Network
Object Tracking using Artificial Neural NetworkAnwar Jameel
 
الجبائر السنية التجميلية - الجزء الاول
الجبائر السنية التجميلية - الجزء الاولالجبائر السنية التجميلية - الجزء الاول
الجبائر السنية التجميلية - الجزء الاولBassem Abu Canon , DDS
 
الجبائر السنية التجميلية -الجزء الثالث
الجبائر السنية التجميلية -الجزء الثالث الجبائر السنية التجميلية -الجزء الثالث
الجبائر السنية التجميلية -الجزء الثالث Bassem Abu Canon , DDS
 
HZL-SAP industrial training
HZL-SAP industrial trainingHZL-SAP industrial training
HZL-SAP industrial trainingAnwar Jameel
 
Energy from inside الطاقة من الداخل
Energy from inside الطاقة من الداخل Energy from inside الطاقة من الداخل
Energy from inside الطاقة من الداخل Bassem Abu Canon , DDS
 
PowerPoint et mon cours - 5 règles pour une présentation efficace
PowerPoint et mon cours - 5 règles pour une présentation efficacePowerPoint et mon cours - 5 règles pour une présentation efficace
PowerPoint et mon cours - 5 règles pour une présentation efficaceGROLLEAU Anne-Céline
 
Babbler réinvente les relations presse #startup #digital #medias #rp
Babbler réinvente les relations presse #startup #digital #medias #rpBabbler réinvente les relations presse #startup #digital #medias #rp
Babbler réinvente les relations presse #startup #digital #medias #rpHannah Oiknine
 
2007 10 25 Le rapid elearning : un atout pour la FOAD
2007 10 25 Le rapid elearning : un atout pour la  FOAD2007 10 25 Le rapid elearning : un atout pour la  FOAD
2007 10 25 Le rapid elearning : un atout pour la FOADnovantura
 
Seocamp2016 : javascript et indexation, où en est-on ?
Seocamp2016 : javascript et indexation, où en est-on ?Seocamp2016 : javascript et indexation, où en est-on ?
Seocamp2016 : javascript et indexation, où en est-on ?Madeline Pinthon
 
Club E-Tourisme // Les réseaux sociaux, des outils au service de la promotion...
Club E-Tourisme // Les réseaux sociaux, des outils au service de la promotion...Club E-Tourisme // Les réseaux sociaux, des outils au service de la promotion...
Club E-Tourisme // Les réseaux sociaux, des outils au service de la promotion...Pays Médoc
 

Viewers also liked (20)

2016 Presidential Candidate Tracker
2016 Presidential Candidate Tracker2016 Presidential Candidate Tracker
2016 Presidential Candidate Tracker
 
Verb to be
Verb to beVerb to be
Verb to be
 
How to Bid bandwidth
How to Bid bandwidthHow to Bid bandwidth
How to Bid bandwidth
 
الجبائر السنية التجميلية - الجزء الثاني
الجبائر السنية التجميلية - الجزء الثانيالجبائر السنية التجميلية - الجزء الثاني
الجبائر السنية التجميلية - الجزء الثاني
 
Algorithm Visualizer
Algorithm VisualizerAlgorithm Visualizer
Algorithm Visualizer
 
Thinkaloud
ThinkaloudThinkaloud
Thinkaloud
 
GallupReport
GallupReportGallupReport
GallupReport
 
Image Classification
Image ClassificationImage Classification
Image Classification
 
Object Tracking using Artificial Neural Network
Object Tracking using Artificial Neural NetworkObject Tracking using Artificial Neural Network
Object Tracking using Artificial Neural Network
 
الجبائر السنية التجميلية - الجزء الاول
الجبائر السنية التجميلية - الجزء الاولالجبائر السنية التجميلية - الجزء الاول
الجبائر السنية التجميلية - الجزء الاول
 
الجبائر السنية التجميلية -الجزء الثالث
الجبائر السنية التجميلية -الجزء الثالث الجبائر السنية التجميلية -الجزء الثالث
الجبائر السنية التجميلية -الجزء الثالث
 
HZL-SAP industrial training
HZL-SAP industrial trainingHZL-SAP industrial training
HZL-SAP industrial training
 
Tambola
TambolaTambola
Tambola
 
Energy from inside الطاقة من الداخل
Energy from inside الطاقة من الداخل Energy from inside الطاقة من الداخل
Energy from inside الطاقة من الداخل
 
Medias sociaux et relations publiques
Medias sociaux et relations publiquesMedias sociaux et relations publiques
Medias sociaux et relations publiques
 
PowerPoint et mon cours - 5 règles pour une présentation efficace
PowerPoint et mon cours - 5 règles pour une présentation efficacePowerPoint et mon cours - 5 règles pour une présentation efficace
PowerPoint et mon cours - 5 règles pour une présentation efficace
 
Babbler réinvente les relations presse #startup #digital #medias #rp
Babbler réinvente les relations presse #startup #digital #medias #rpBabbler réinvente les relations presse #startup #digital #medias #rp
Babbler réinvente les relations presse #startup #digital #medias #rp
 
2007 10 25 Le rapid elearning : un atout pour la FOAD
2007 10 25 Le rapid elearning : un atout pour la  FOAD2007 10 25 Le rapid elearning : un atout pour la  FOAD
2007 10 25 Le rapid elearning : un atout pour la FOAD
 
Seocamp2016 : javascript et indexation, où en est-on ?
Seocamp2016 : javascript et indexation, où en est-on ?Seocamp2016 : javascript et indexation, où en est-on ?
Seocamp2016 : javascript et indexation, où en est-on ?
 
Club E-Tourisme // Les réseaux sociaux, des outils au service de la promotion...
Club E-Tourisme // Les réseaux sociaux, des outils au service de la promotion...Club E-Tourisme // Les réseaux sociaux, des outils au service de la promotion...
Club E-Tourisme // Les réseaux sociaux, des outils au service de la promotion...
 

Similar to Summer Research Project (Anusaaraka) Report

Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
 
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...
Design Analysis Rules to Identify Proper Noun  from Bengali Sentence for Univ...Design Analysis Rules to Identify Proper Noun  from Bengali Sentence for Univ...
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...Syeful Islam
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
 
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...Syeful Islam
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsAdnanBaloch15
 
A REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIA
A REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIAA REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIA
A REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIAJoe Osborn
 
Design and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic TextsDesign and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic TextsIJCSIS Research Publications
 
Determining the Effectiveness of the Developed Prototype That Translate Pakis...
Determining the Effectiveness of the Developed Prototype That Translate Pakis...Determining the Effectiveness of the Developed Prototype That Translate Pakis...
Determining the Effectiveness of the Developed Prototype That Translate Pakis...Premier Publishers
 
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...Waqas Tariq
 
Role of language engineering to preserve endangered languages
Role of language engineering to preserve endangered languagesRole of language engineering to preserve endangered languages
Role of language engineering to preserve endangered languagesDr. Amit Kumar Jha
 
Investigations of the Distributions of Phonemic Durations in Hindi and Dogri
Investigations of the Distributions of Phonemic Durations in Hindi and DogriInvestigations of the Distributions of Phonemic Durations in Hindi and Dogri
Investigations of the Distributions of Phonemic Durations in Hindi and Dogrikevig
 
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDI
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDIDEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDI
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDIijaia
 
Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...
Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...
Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...CSCJournals
 
An Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop StemmerAn Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop Stemmerkevig
 
Teaching english to engineering students in india
Teaching english to engineering students in indiaTeaching english to engineering students in india
Teaching english to engineering students in indiaAlexander Decker
 
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...kevig
 
A Survey Of Current Datasets For Code-Switching Research
A Survey Of Current Datasets For Code-Switching ResearchA Survey Of Current Datasets For Code-Switching Research
A Survey Of Current Datasets For Code-Switching ResearchJim Webb
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET Journal
 

Similar to Summer Research Project (Anusaaraka) Report (20)

Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
 
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...
Design Analysis Rules to Identify Proper Noun  from Bengali Sentence for Univ...Design Analysis Rules to Identify Proper Noun  from Bengali Sentence for Univ...
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
 
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
A REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIA
A REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIAA REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIA
A REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIA
 
Design and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic TextsDesign and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic Texts
 
Determining the Effectiveness of the Developed Prototype That Translate Pakis...
Determining the Effectiveness of the Developed Prototype That Translate Pakis...Determining the Effectiveness of the Developed Prototype That Translate Pakis...
Determining the Effectiveness of the Developed Prototype That Translate Pakis...
 
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
 
Role of language engineering to preserve endangered languages
Role of language engineering to preserve endangered languagesRole of language engineering to preserve endangered languages
Role of language engineering to preserve endangered languages
 
Investigations of the Distributions of Phonemic Durations in Hindi and Dogri
Investigations of the Distributions of Phonemic Durations in Hindi and DogriInvestigations of the Distributions of Phonemic Durations in Hindi and Dogri
Investigations of the Distributions of Phonemic Durations in Hindi and Dogri
 
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDI
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDIDEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDI
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDI
 
Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...
Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...
Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...
 
An Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop StemmerAn Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop Stemmer
 
Ijetcas14 444
Ijetcas14 444Ijetcas14 444
Ijetcas14 444
 
Teaching english to engineering students in india
Teaching english to engineering students in indiaTeaching english to engineering students in india
Teaching english to engineering students in india
 
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...
 
A Survey Of Current Datasets For Code-Switching Research
A Survey Of Current Datasets For Code-Switching ResearchA Survey Of Current Datasets For Code-Switching Research
A Survey Of Current Datasets For Code-Switching Research
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
 

Recently uploaded

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 

Recently uploaded (20)

2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 

Summer Research Project (Anusaaraka) Report

  • 1. 1 Abstract Anusaaraka is an English – Hindi language accessing software. With insights from Panini's Ashtadhyayi (Grammar rules), Anusaaraka is a machine translation tool being developed by the Chinmaya International Foundation (CIF), International Institute of Information Technology, Hyderabad (IIIT -H) and University of Hyderabad (Department of Sanskrit Studies). Fusion of traditional Indian shastras and advanced modern technologies is what Anusaaraka is all about. Anusaaraka allows users to access text in any Indian language, after translation from the source language (i.e. English or any other regional Indian language). In today's Information Age large volumes of information is available in English – whether it be information for competitive exams or even general reading. However, a lot of the educated masses whose primary language is Hindi or a regional Indian language are unable to access information in English. Anusaaraka aims to bridge this language barrier by allowing a user to enter an Englis h text into Anusaaraka and get the translation of the same in an Indian language. The Anusaaraka being referred to here has English as the source language and Hindi as the target language. Anusaaraka derives its name from the Sanskrit word ‘ Anusaran’ which means ‘to follow’. It is so called, as the translated Anusaaraka output appears in layers – i.e. a sequence of steps that follow each other till the final translation is displayed to the user.
  • 2. 2 International Institute of Information Technology (IIIT), Hyderabad The International Institute of Information Technology, Hyderabad (IIIT -H) is an autonomous university founded in 1998. It was set up as a not -for-profit public private partnership (NPPP) and is the first IIIT to be set up (under this mo del) in India. The Government of Andhra Pradesh lent support to the institute by grant of land and buildings. A Governing Council consisting of eminent people from academia, industry and government presides over the governance of the institution. IIIT-H was set up as a research university focused on the core areas of Information Technology, such as Computer Science, Electronics and Communications, and their applications in other domains. The institute evolved strong research programs in a host of areas, with computation or IT providing the connecting thread, and with an emphasis on the development of technology and applications, which can be transferred for use to industry and society. This required carrying out basic research that ca n be used to solve real life problems. As a result, a synergistic relationship has come to exist at the Institute between basic and applied research. Faculty carries out a number of academic industrial projects, and a few companies have been incubated base d on the research done at the Institute. IIIT-H is organized as research centers and labs, instead of the conventional departments, to facilitate inter -disciplinary research and a seamless flow of knowledge within the Institute. Faculty assigned to the ce nters and labs conduct research, as well as academic programs, which are owned by the Institute, and not by individual research centers.
  • 3. 3 Machine Translation Machine Translation is an important technology for localization, and is particularly relevant in a linguistically diverse c ountry like India. Human translation in India is a rich and ancient tradition. Works of philosophy, arts, mythology, religion, science and folklore have been translated among the ancient and modern Indian languages. Numerous cl assic works of art, ancient, medieval and modern, have also been translated between European and Indian languages since the 18 t h century. In the current era, human translation finds application mainly in the administration, media and education, and to a l esser extent, in business, arts and science and technology. India has a linguistically rich area —it has 18 constitutional languages, which are written in 10 different scripts. Hindi is the official language of the Union. English is very widely used in the media, commerce, science and technology and education. Many of the states have their own regional language, which is either Hindi or one of the other constitutional languages. Only about 5% of the population speaks English. In such a situation, there is a big market for translation between English and the various Indian languages. Currently, this translation is essentially manual. Use of automation is largely restricted to word processing. Two specific examples of high volume manual translation are—translation of news from English into local languages, translation of annual reports of government departments and public sector units among, English, Hindi and the local language. As is clear from above, the market is largest for translation from English into Indian languages, primarily Hindi. Hence, it is no surprise that a majority of the Indian Machine Translation (MT) systems are for English-Hindi translation. Natural language processing presents many challenges, of which the biggest is the inherent ambiguity of natural language. MT systems have to deal with ambiguity, and various other NL phenomena. In addition, the linguistic diversity between the source and target language mak es MT a bigger challenge. This is particularly true of widely divergent languages such as English and Indian languages. The major structural difference between English and Indian languages can be summarized as
  • 4. 4 follows. English is a highly positional langu age with rudimentary morphology, and default sentence structure. Indian languages are highly inflectional, with a rich morphology, relatively free word order, and d efault sentence structure. In addition, there are many stylistic differences. For example, i t is common to see very long sentences in English, using abstract concepts as the subjects of sentences, and stringing several clauses together (as in this sentence!). Such constructions are not natural in Indian languages, and present major difficulties in producing good translations. As is recognized the world over, with the current state of art in MT, it is not possible to have Fully Automatic, High Quality, and General -Purpose Machine Translation. Practical systems need to handle ambiguity and the other complexities of natural language processing, by relaxing one or more of the ab ove dimensions. Thus, we can have automatic high -quality ‘sub-language’ systems for specific domains, or automatic general -purpose systems giving rough translation, or interactive general-purpose systems with pre or post ed iting. Why Machine Translation? Today technology has made it possible for individuals worldwide to access large volumes of information at the click of a button. However, very often the information sought may not be in a language that the individual is familiar with. Thus, Machine Translation is an endeavor to minimize the language barrier , by making it possible to access a text i n the language of one's choice. For technology to be able to provide the above fac ility, many aspects of language are involved. To name a few: •Script •Spelling •Vocabulary •Morphology •Syntax
  • 5. 5 Keeping the above in mind, m achine translation systems need to be equipped to translate a text within seconds and yet capture the information of the text to the best possible extent.
  • 6. 6 Anusaaraka The focus in Anusaaraka is not mainly on machine translation, but on Language Access between Indian languages. Using principles of Paninian Grammar (PG), and exploiting the close similarity of Indian languages, Anusaaraka essentially maps local word groups between the source and target languages. Where there are differences between the languages, the system introduces extra notation to preserve the information of the source language. Thus, the user needs some training to understand the output of the system. The project has developed Language Accessors from many Indian langua ges into Hindi. Anusaaraka maps constructions in the source language to the corresponding constructions in the target language wherever possible. For example, a noun or pronoun in the source language is mapped to an appropriate noun or pronoun, respectively, in the target language as shown below: @H: Apa pustaka paDha_raHA_[HE|thA]_kyA{23_ba.}? !E: You book read_ing_[is|was] Q.? E: Are/were you reading a book? (Where the prefixes mean the following: @H=anusaaraka Hindi, !E=English gloss, E=Engli sh.) In the example above, the last wor d in the sentence is a verb and illustrates the mapping morpheme by morpheme: the root is mapped to 'paDha' (read), and similarly the tense-aspect-modality (TAM) label is mapped to 'raHA_[HE|thA]' (is_*ing or was_*ing), which is followed by 'A' suffix which gets mapped to 'kyA' (what) as a question mark in Hindi. Gender, number, and person (GNP) information is also shown separately in curly bra ckets ('{23_ba.}' for second or third person and plural).
  • 7. 7 Sometimes, for a construction in the source language, the same construction is not available in the target langu age. In such a case, the system chooses another construction in the ta rget language in which the same information can be expressed. In the example below, the system choses the complementizer construction in Hindi (EsA) to express the same sense: @H: hamArA_ ladakI_ko` nOkarI karanA_EsA nahIM_[hE|WA]. !E: Our daughter (dat.) job do_should_that not (fem.) E: It is not the case that our daughter should get a job. However, Anusaaraka shows the image and therefore, it uses the complementizer (EsA). Sometimes there are slight difference s between a construction in the source language to a similar const ruction in the target language because of which information might not be preserved. In such a situation additional notati on is introduced to express the information which would otherwise get lost. A simple example of this is the lack of distinction between personal pronoun and pronominal adjective in Hindi: vaha. @H: vaha` pAThshAlA_ko` gayA. !E: he school (dat.) went. E: He went to school. @H: vaha- pAThshAlA_ko` TrophI AyI. !E: that school (dat.) trophy came E: That school received the trophy. When transferring from one language to the other , this distinction would have disappeared, if care was not taken. In Anusaaraka, the two forms are made different by introducing additional notation: vaha` (he) vaha- (that)
  • 8. 8 Salient Features of Anusaaraka Faithful representation of text in source language: Throughout the various layers of Anusaaraka outp ut there is an effort to ensure that the user should be able to understand the information contained in the English sentence. This is given greater importance than giving perfect sentences in Hindi, for it would be pointless to have a translation that reads well but does not truly capture the information of the source text. The layered output is unique to Anusaaraka. Thus, source language text information and how the Hindi translation is finally arrived at can be accessed by the user. The important feature of the layered output is that the information transfer is done in a controlled manner at every step thus, making it possible to revert back without any loss of information. Also, any loss of information t hat cannot be avoided in a translation process is then done in a gradual way. Therefore, even if the translated sentence is not as 'perfect' as human translation, with some effort and orientation on reading Anusaaraka output, an individual can understand what the source text is implying by looking at the layers and context in which that sentence appears. Reversibility: The feature of gradual transference of information from one layer to the next, gives Anusaaraka an additional advantage of bringing rever sibility in the translation process – a feature which cannot be achieved by a conventional machine translation system. A bi -lingual user of Anusaaraka can, at any point, access the source language text in English, because of the transparency in the output. Some amount of orientation on how to read the Anusaaraka output would be required for this.
  • 9. 9 Transparency: Display of step-by-step translation layers gives an increased level of confidence to the end-user, as he can trace back to the source and get clar ity regarding translated text by analysis of the output layers and some reference to context.
  • 10. 10 Champollion Champollion is a Robust Parallel Text Sentence Aligner . Parallel text is a very valuable resource for a number of natural language processing tasks, including machine translation, cross language information retri eval, and word disambiguation. Parallel text provides the maximum utility when it is sentence aligned. The sentence alignment process maps sentences in the source tex t to their translation. The labo ur intensive and time consuming nature of manual sentence alignment makes large parallel text corpus development difficult. Thus a number of automatic sentence alignment approaches have been proposed and utilized; some are pure length based approaches, some are lexicon based, and some are a mixture of the two approaches. While existing approaches perform reasonably well on close language pairs, their performance degrades quickly on remote language pairs such as English and Chinese. Performance degradation is exace rbated by noise in the data. Champollion was initially developed for aligning Chinese -English parallel text. It was later ported to other language pairs, including Arabic – English and Hindi – English. Champollion differs from other sentence aligners in two ways. First, it assumes a noisy input, i.e. a large percentage of alignments will not be one to one alignments, and that the number of deletions and insertions will be significant. The assumption is against declaring a match in the absence of lexical evidence. Non - lexical measures, such as sentence length information – which are often unreliable when dealing with noisy data – can and should still be used, but they should only play a supporting role when lexical evidence is present. Second, Champollion differs from other lexicon-based approaches in assignin g weights to translated words. Translation lexicons usually help sentence aligners in the following way: first, translated words are identified by usi ng entries from a translation lexicon;
  • 11. 11 second, statistics of translated words are then used to identify sentence correspondences. In most existing sentence alignment algorithms, translated words are treated equally, i.e. translated word pairs are assigned equal weight when deci ding sentence correspondences. For example, 1-1 alignment constitutes 89% of the UBS English-French corpus and 1-0 and 0-1 alignments constitute merely 1.3%. However, when creating very large parallel corpora, the data can be very no isy. For example, in a UN Chin ese English corpus, 6.4% of all alignments are either 1 - 0 or 0-1 alignment. Some of the omissions and insertions were introduced during the translation of the text. Most of the omissions and insertions, however, are introduced during different stages of processing before sentence alignment is carried out. The pre-processing steps include converting the raw data to plain text format, removing tables, foot notes, end notes, etc. Most of these steps introduce noise. For instance, while a table in an English document can be completely removed, this is not necessarily the case in any given Chinese document. Because of the sheer number of documents involved, manually examining each document after pre-processing is impossible. A robus t sentence aligner needs not only to detect most categories of noise, but also to recover quickly if an error is made. It has been proved that existing methods work very well on clean data, but their performance goes down quickly as data becomes noisy.
  • 12. 12 CODES Code for extracting regular text from xml file: #include<stdio.h> #include<string.h> #include<stdlib.h> //MAXIMUM NUMBER OF PAGES ALLOWED #define MAX 200 //EXTENSION OF THE FILES BEING CREATED FOR EACH PAGE #define EXTENSION ".xml" //LENGTH OF THE EXTENSION OF THE FILE #define EXTENSION_LENGTH strlen(EXTENSION) char temp[MAX]; //EXACT NUMBER OF PAGES IN THE SOURCE XML FILE int totalPages; //CONTAINS THE CURRENT PAGE NUMBER CONVERTED TO ITS CORRESPONDING FILENAME char pageNumber[20]; //FILE POINTERS FOR READING THE PAGE FILE AND WRITING TO FINAL TEXT FILES //TWO TEXT FILES ARE CREATED //ONE FOR NON SORTED AND THE OTHER FOR SORTED DATA ACCORDING TO CO- ORDINATES OF THE TEXT ON THE PAGE FILE *fr,*fw; //STRUCTURE FOR THE CONTENTS OF A SINGLE LINE OF THE XML FILE struct Line { int top; int left; int width; int height; int font; char text[10000]; }; //STRUCTURE FOR THE CONTENTS OF A SINGLE PAGE OF XML FILE struct Page { struct Line line[MAX]; int lines; }; //STRUCTURE FOR THE PAGE HEADER struct Header { int fontId; char fontSize[10]; char color[10]; struct Header *link; }; typedef struct Header* HEADER; struct Page pages[MAX]; HEADER head; //CONTAINS THE FONTS FOR WHICH THE TEXT IS TO BE EXTRACTED int fonts[MAX]; //CONTAINS TOTAL NUMBER OF FONTS int totalFonts; HEADER getHeader() { return((HEADER)malloc(1*sizeof(struct Header))); }
  • 13. 13 void generatePages(char arg[50]) { char arr[100]="./genPages.out "; strcat(arr,arg); printf("Creating Pagesn"); system("cc genPages.c -o genPages.out"); system(arr); printf("Pages createdn"); } void convertToText(int page) { int l,i,j; char rev[20]; i=0; while(page!=0) { rev[i++]=(page%10)+48; page=page/10; } l=i; i--; for(j=0;j<l;j++) { pageNumber[j]=rev[i--]; } for(i=0;i<EXTENSION_LENGTH;i++) pageNumber[i+l]=EXTENSION[i]; pageNumber[i+l]='0'; } void fetchHeader() { int c,i; HEADER t,cur; while(1) { c=getc(fr); if(c=='<') { c=getc(fr); if(c=='f') { t=getHeader(); t->link=NULL; while(!isdigit(c=getc(fr))); i=0; while(isdigit(c)) { temp[i++]=c; c=getc(fr); } temp[i]='0'; t->fontId=atoi(temp); while(!isdigit(c=getc(fr))); i=0; while(isdigit(c)) {
  • 15. 15 continue; } pages[pgNo].line[pages[pgNo].lines].text[i++]=c; } pages[pgNo].line[pages[pgNo].lines].text[i]='0'; } void fetchPageInfo(int pgNo) { int c,i; c=getc(fr); while(c!=EOF) { while(!isdigit(c=getc(fr))); i=0; while(isdigit(c)) { temp[i++]=c; c=getc(fr); } temp[i]='0'; pages[pgNo].line[pages[pgNo].lines].top=atoi(temp); while(!isdigit(c=getc(fr))); i=0; while(isdigit(c)) { temp[i++]=c; c=getc(fr); } temp[i]='0'; pages[pgNo].line[pages[pgNo].lines].left=atoi(temp); while(!isdigit(c=getc(fr))); i=0; while(isdigit(c)) { temp[i++]=c; c=getc(fr); } temp[i]='0'; pages[pgNo].line[pages[pgNo].lines].width=atoi(temp); while(!isdigit(c=getc(fr))); i=0; while(isdigit(c)) { temp[i++]=c; c=getc(fr); } temp[i]='0'; pages[pgNo].line[pages[pgNo].lines].height=atoi(temp); while(!isdigit(c=getc(fr)));
  • 16. 16 i=0; while(isdigit(c)) { temp[i++]=c; c=getc(fr); } temp[i]='0'; pages[pgNo].line[pages[pgNo].lines].font=atoi(temp); printf("Fetching text for line %dn",pages[pgNo].lines); c=getc(fr); fetchText(pgNo); pages[pgNo].lines++; c=getc(fr); c=getc(fr); } } void fetchFontId(int argc,char *argv[]) { int i; HEADER cur; for(i=3;i<argc-1;i=i+2) { cur=head; while(cur!=NULL) { if((strcmp(argv[i],cur- >fontSize)==0)&&(strcmp(argv[i+1],cur->color)==0)) { fonts[totalFonts++]=cur->fontId; } cur=cur->link; } } } void createPages() { int i; for(i=1;i<=totalPages;i++) { convertToText(i); fr=fopen(pageNumber,"r"); if(fr==NULL) { printf("Cannot open the file %snExittingn",pageNumber); exit(0); } printf("Fetching information of Page %dn",i); fetchHeader(); pages[i].lines=0; fetchPageInfo(i); printf("Information of Page %d fetchedn",i); fclose(fr); } } int checkFont(int fnt) { int i;
  • 17. 17 for(i=0;i<totalFonts;i++) { if(fnt==fonts[i]) return(1); } return(0); } void sortPage(int pgNo) { struct Line temp; int i,j; for(i=0;i<pages[pgNo].lines-1;i++) { for(j=i+1;j<pages[pgNo].lines;j++) { if(pages[pgNo].line[i].top>=pages[pgNo].line[j].top) { if(pages[pgNo].line[i].top==pages[pgNo].line[j].top) { if(pages[pgNo].line[i].left>pages[pgNo].line[j].left) { temp=pages[pgNo].line[i]; pages[pgNo].line[i]=pages[pgNo].line[j]; pages[pgNo].line[j]=temp; } } else { temp=pages[pgNo].line[i]; pages[pgNo].line[i]=pages[pgNo].line[j]; pages[pgNo].line[j]=temp; } } } } } void writeText(int pgNo) { int i; for(i=0;i<pages[pgNo].lines;i++) { if(checkFont(pages[pgNo].line[i].font)) { fputs(pages[pgNo].line[i].text,fw); putc('n',fw); } } } void createTextFile(char arg[MAX]) { int i; for(i=1;i<=totalPages;i++) { writeText(i); } fclose(fw); strcat(arg,"_sorted"); fw=fopen(arg,"w");
  • 18. 18 if(fw==NULL) { printf("Cannot create the file %snEXITINGn",arg); return; } for(i=1;i<=totalPages;i++) { sortPage(i); writeText(i); } } main(int argc,char *argv[]) { totalPages=atoi(argv[2]); generatePages(argv[1]); head=NULL; createPages(); totalFonts=0; fetchFontId(argc,argv); fw=fopen(argv[argc-1],"w"); if(fw==NULL) { printf("Cannot create the file %snEXITINGn",argv[argc-1]); return(0); } createTextFile(argv[argc-1]); fclose(fw); return(0); } Code for dividing the xml into pages in accordance with the .pdf file used: #include<stdio.h> #include<string.h> #define MAX 2000 #define START_PATTERN "<page" #define START_PATTERN_LENGTH strlen(START_PATTERN) #define END_PATTERN "</page>" #define END_PATTERN_LENGTH strlen(END_PATTERN) #define EXTENSION ".xml" #define EXTENSION_LENGTH strlen(EXTENSION) FILE *fr,*fw; char temp[MAX]; char pageNumber[20]; void convertToText(int page) { int l,i,j; char rev[20]; i=0; while(page!=0) { rev[i++]=(page%10)+48; page=page/10; }
  • 20. 20 } do { if(skip()==EOF) break; convertToText(page); fw=fopen(pageNumber,"w"); if(fw==NULL) { printf("Cannot create file %sn",pageNumber); return(0); } else printf("File for Page Number %s createdn",pageNumber); do { c=getc(fr); if(c=='<') { ungetc(c,fr); r=checkPageEnd(); if(r) break; //putc('<',fw); for(i=0;temp[i]!='0';i++) putc(temp[i],fw); } else { putc(c,fw); } }while(1); fclose(fw); page++; }while(c!=EOF); fclose(fr); return(0); }
  • 21. 21 Word-Sense Disambiguation (WSD) In computational linguistics , word-sense disambiguation (WSD) is an open problem of natural language processing , which governs the process of identifying which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings. The solution to this problem impacts other computer -related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence. A disambiguation process requires two strict things: a dictionary to specify the senses which are to be disambiguated and a corpus of language data to be disambiguated (in some methods, a training corpus of language examples is also required). WSD task has two variants: " lexical sample" and "all words" task. The former comprises disambiguating the occurrences of a small sample of target words which were previously selecte d, while in the latter all the words in a piece of running text need to be disambiguated. The latter is deemed a more realistic form of evaluation, but the corpus is more expensive to produce because human annotators have to read the definitions for each w ord in the sequence every time they need to make a tagging judgement, rather than once for a block of instances for the same target word. To give a hint how all this works, consider two examples of the distinct senses that exist for the (written) word " bass":  a type of fish  tones of low frequency and the sentences:  I went fishing for some sea bass.  The bass line of the song is too weak.
  • 22. 22 To a human, it is obvious that the first sentence is using the word " bass (fish)", as in the former sense above and in the second sentence, the word " bass (instrument)" is being used as in the latter sense below. Developing algorithms to replicate this human ability can often be a difficult task, as is further exemplified by the implicit equivocation between " bass (sound)" and "bass" (musical instrument). C Language Integrated Production System: CLIPS is an expert system tool originally developed by the Software Technology Branch (STB), NASA/Lyndon B. Johnson Space Center. Since its first release in 1986, CLIPS has undergone continual refinement and improvement. It is now used by thousands of people around the world. CLIPS is designed to facilitate the development of software to model human knowledge or expertise. There are three ways to represent knowledge in CLIPS: • Rules, which are primarily intended f or heuristic knowledge based on experience. • Deffunctions and generic functions, which are primarily intended for procedural knowledge. • Object-oriented programming , also primarily intended for procedural knowledge. The five generally accep ted features of object -oriented programming are supported: classes, message-handlers, abstraction, encapsulation, inheritance, polymorphism. Rules may pattern match on objects and facts. We can develop software using only rules, only objects, or a mixture of objects and rules. CLIPS has also been designed for int egration with other languages such as C and Java. Rules and objects form an integrated system too since rules can pattern-match on facts and objects. In addition to being used as a stand-alone tool, CLIPS can be called from a procedural language, perform i ts function, and then return control back to the calli ng program. Likewise, procedural code can be defined as external functions and called from CLIPS. When the external code completes execu tion, control returns to CLIPS. CLIPS is an excellent tool for word-sense disambiguation .
  • 23. 23 Conclusion MT is relatively new in India – about a decade old. In comparison with MT efforts in Europe and Japan, which are at least 3 decades old, it would seem that Indian MT has a long way to go. However, this can also be an advantage, because Indian researchers can learn from the experience of their global counterparts. There are close to a dozen projects now, with about 6 of them being in advanced prototype or technology transfer stage, and the rest having been newly initiated. The Indian NLP/MT scene so far has been characterized by an acute scarcity of basic lexical resources such as corpora, MRDs, lexicons, thesauri and terminology banks. Also, the various MT groups have used different formalisms best suite to their specific a pplications, and hence there has been little sh aring of resources among them. These issues are being addressed now. There are governmental as well as voluntary efforts under way to develop common lexical resources, and to create forums for consolidating an d coordinating NLP and MT efforts. It appears that the exploratory phase of Indian MT is over, and the consolidation phase is about to begin, with the focus moving from proof -of- concept prototypes to productionization, deployment, collaborative resource sharing and evaluation. The core Anusaaraka output is in a language close to the target language, and can be understood by the hu man reader after some training. The question is how much training is nece ssary to get a very high degree of comprehension. Our experience of working among Indian languages shows that this training is likely to be small. Re ason for this is that India forms a linguistic area: Indian languages share vocabulary and grammatical constructions. There are also shared pragmatics and culture . Similar approach can be applied to build English to Hindi Anusaaraka. A study can be conducted related to tr aining required to read such an output. The expectation is that English to Hindi usable system can be built except that it will require longer training.