Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Summer Research Project (Anusaaraka) Report
1. 1
Abstract
Anusaaraka is an English – Hindi language accessing software. With insights from
Panini's Ashtadhyayi (Grammar rules), Anusaaraka is a machine translation tool
being developed by the Chinmaya International Foundation (CIF), International
Institute of Information Technology, Hyderabad (IIIT -H) and University of
Hyderabad (Department of Sanskrit Studies). Fusion of traditional Indian shastras
and advanced modern technologies is what Anusaaraka is all about.
Anusaaraka allows users to access text in any Indian language, after
translation from the source language (i.e. English or any other regional Indian
language). In today's Information Age large volumes of information is available in
English – whether it be information for competitive exams or even general reading.
However, a lot of the educated masses whose primary language is Hindi or a
regional Indian language are unable to access information in English. Anusaaraka
aims to bridge this language barrier by allowing a user to enter an Englis h text
into Anusaaraka and get the translation of the same in an Indian language. The
Anusaaraka being referred to here has English as the source language and Hindi as
the target language.
Anusaaraka derives its name from the Sanskrit word ‘ Anusaran’ which
means ‘to follow’. It is so called, as the translated Anusaaraka output appears in
layers – i.e. a sequence of steps that follow each other till the final translation is
displayed to the user.
2. 2
International Institute of Information
Technology (IIIT), Hyderabad
The International Institute of Information Technology, Hyderabad (IIIT -H) is an
autonomous university founded in 1998. It was set up as a not -for-profit public
private partnership (NPPP) and is the first IIIT to be set up (under this mo del) in
India. The Government of Andhra Pradesh lent support to the institute by grant of
land and buildings. A Governing Council consisting of eminent people from
academia, industry and government presides over the governance of the institution.
IIIT-H was set up as a research university focused on the core areas of
Information Technology, such as Computer Science, Electronics and
Communications, and their applications in other domains. The institute evolved
strong research programs in a host of areas, with computation or IT providing the
connecting thread, and with an emphasis on the development of technology and
applications, which can be transferred for use to industry and society. This
required carrying out basic research that ca n be used to solve real life problems.
As a result, a synergistic relationship has come to exist at the Institute between
basic and applied research. Faculty carries out a number of academic industrial
projects, and a few companies have been incubated base d on the research done at
the Institute.
IIIT-H is organized as research centers and labs, instead of the
conventional departments, to facilitate inter -disciplinary research and a seamless
flow of knowledge within the Institute. Faculty assigned to the ce nters and labs
conduct research, as well as academic programs, which are owned by the Institute,
and not by individual research centers.
3. 3
Machine Translation
Machine Translation is an important technology for localization, and is
particularly relevant in a linguistically diverse c ountry like India. Human
translation in India is a rich and ancient tradition. Works of philosophy, arts,
mythology, religion, science and folklore have been translated among the ancient
and modern Indian languages. Numerous cl assic works of art, ancient, medieval
and modern, have also been translated between European and Indian languages
since the 18
t h
century. In the current era, human translation finds application
mainly in the administration, media and education, and to a l esser extent, in
business, arts and science and technology. India has a linguistically rich area —it
has 18 constitutional languages, which are written in 10 different scripts. Hindi is
the official language of the Union. English is very widely used in the media,
commerce, science and technology and education. Many of the states have their
own regional language, which is either Hindi or one of the other constitutional
languages. Only about 5% of the population speaks English. In such a situation,
there is a big market for translation between English and the various Indian
languages. Currently, this translation is essentially manual. Use of automation is
largely restricted to word processing. Two specific examples of high volume
manual translation are—translation of news from English into local languages,
translation of annual reports of government departments and public sector units
among, English, Hindi and the local language.
As is clear from above, the market is largest for translation from English
into Indian languages, primarily Hindi. Hence, it is no surprise that a majority of
the Indian Machine Translation (MT) systems are for English-Hindi translation.
Natural language processing presents many challenges, of which the biggest is the
inherent ambiguity of natural language. MT systems have to deal with ambiguity,
and various other NL phenomena. In addition, the linguistic diversity between the
source and target language mak es MT a bigger challenge. This is particularly true
of widely divergent languages such as English and Indian languages. The major
structural difference between English and Indian languages can be summarized as
4. 4
follows. English is a highly positional langu age with rudimentary morphology,
and default sentence structure. Indian languages are highly inflectional, with a rich
morphology, relatively free word order, and d efault sentence structure. In addition,
there are many stylistic differences. For example, i t is common to see very long
sentences in English, using abstract concepts as the subjects of sentences, and
stringing several clauses together (as in this sentence!). Such constructions are not
natural in Indian languages, and present major difficulties in producing good
translations.
As is recognized the world over, with the current state of art in MT, it is
not possible to have Fully Automatic, High Quality, and General -Purpose Machine
Translation. Practical systems need to handle ambiguity and the other complexities
of natural language processing, by relaxing one or more of the ab ove dimensions.
Thus, we can have automatic high -quality ‘sub-language’ systems for specific
domains, or automatic general -purpose systems giving rough translation, or
interactive general-purpose systems with pre or post ed iting.
Why Machine Translation?
Today technology has made it possible for individuals worldwide to access large
volumes of information at the click of a button. However, very often the
information sought may not be in a language that the individual is familiar with.
Thus, Machine Translation is an endeavor to minimize the language barrier , by
making it possible to access a text i n the language of one's choice. For technology
to be able to provide the above fac ility, many aspects of language are involved.
To name a few:
•Script
•Spelling
•Vocabulary
•Morphology
•Syntax
5. 5
Keeping the above in mind, m achine translation systems need to be
equipped to translate a text within seconds and yet capture the information of the
text to the best possible extent.
6. 6
Anusaaraka
The focus in Anusaaraka is not mainly on machine translation, but on Language
Access between Indian languages. Using principles of Paninian Grammar (PG), and
exploiting the close similarity of Indian languages, Anusaaraka essentially maps
local word groups between the source and target languages. Where there are
differences between the languages, the system introduces extra notation to
preserve the information of the source language. Thus, the user needs some
training to understand the output of the system. The project has developed
Language Accessors from many Indian langua ges into Hindi.
Anusaaraka maps constructions in the source language to the
corresponding constructions in the target language wherever possible. For
example, a noun or pronoun in the source language is mapped to an appropriate
noun or pronoun, respectively, in the target language as shown below:
@H: Apa pustaka paDha_raHA_[HE|thA]_kyA{23_ba.}?
!E: You book read_ing_[is|was] Q.?
E: Are/were you reading a book?
(Where the prefixes mean the following:
@H=anusaaraka Hindi, !E=English gloss, E=Engli sh.)
In the example above, the last wor d in the sentence is a verb and illustrates the
mapping morpheme by morpheme: the root is mapped to 'paDha' (read), and
similarly the tense-aspect-modality (TAM) label is mapped to 'raHA_[HE|thA]'
(is_*ing or was_*ing), which is followed by 'A' suffix which gets mapped to 'kyA'
(what) as a question mark in Hindi. Gender, number, and person (GNP) information
is also shown separately in curly bra ckets ('{23_ba.}' for second or third person
and plural).
7. 7
Sometimes, for a construction in the source language, the same
construction is not available in the target langu age. In such a case, the system
chooses another construction in the ta rget language in which the same information
can be expressed. In the example below, the system choses the complementizer
construction in Hindi (EsA) to express the same sense:
@H: hamArA_ ladakI_ko` nOkarI karanA_EsA nahIM_[hE|WA].
!E: Our daughter (dat.) job do_should_that not (fem.)
E: It is not the case that our daughter should get a job.
However, Anusaaraka shows the image and therefore, it uses the complementizer
(EsA). Sometimes there are slight difference s between a construction in the source
language to a similar const ruction in the target language because of which
information might not be preserved. In such a situation additional notati on is
introduced to express the information which would otherwise get lost. A simple
example of this is the lack of distinction between personal pronoun and pronominal
adjective in Hindi: vaha.
@H: vaha` pAThshAlA_ko` gayA.
!E: he school (dat.) went.
E: He went to school.
@H: vaha- pAThshAlA_ko` TrophI AyI.
!E: that school (dat.) trophy came
E: That school received the trophy.
When transferring from one language to the other , this distinction would have
disappeared, if care was not taken. In Anusaaraka, the two forms are made
different by introducing additional notation:
vaha` (he)
vaha- (that)
8. 8
Salient Features of Anusaaraka
Faithful representation of text in source language:
Throughout the various layers of Anusaaraka outp ut there is an effort to ensure
that the user should be able to understand the information contained in the English
sentence. This is given greater importance than giving perfect sentences in Hindi,
for it would be pointless to have a translation that reads well but does not truly
capture the information of the source text.
The layered output is unique to Anusaaraka. Thus, source language text
information and how the Hindi translation is finally arrived at can be accessed by
the user. The important feature of the layered output is that the information
transfer is done in a controlled manner at every step thus, making it possible to
revert back without any loss of information. Also, any loss of information t hat
cannot be avoided in a translation process is then done in a gradual way.
Therefore, even if the translated sentence is not as 'perfect' as human translation,
with some effort and orientation on reading Anusaaraka output, an individual can
understand what the source text is implying by looking at the layers and context in
which that sentence appears.
Reversibility:
The feature of gradual transference of information from one layer to the next,
gives Anusaaraka an additional advantage of bringing rever sibility in the
translation process – a feature which cannot be achieved by a conventional
machine translation system. A bi -lingual user of Anusaaraka can, at any point,
access the source language text in English, because of the transparency in the
output. Some amount of orientation on how to read the Anusaaraka output would be
required for this.
9. 9
Transparency:
Display of step-by-step translation layers gives an increased level of confidence to
the end-user, as he can trace back to the source and get clar ity regarding translated
text by analysis of the output layers and some reference to context.
10. 10
Champollion
Champollion is a Robust Parallel Text Sentence Aligner . Parallel text is a very
valuable resource for a number of natural language processing tasks, including
machine translation, cross language information retri eval, and word
disambiguation. Parallel text provides the maximum utility when it is sentence
aligned. The sentence alignment process maps sentences in the source tex t to their
translation. The labo ur intensive and time consuming nature of manual sentence
alignment makes large parallel text corpus development difficult. Thus a number of
automatic sentence alignment approaches have been proposed and utilized; some
are pure length based approaches, some are lexicon based, and some are a mixture
of the two approaches.
While existing approaches perform reasonably well on close language
pairs, their performance degrades quickly on remote language pairs such as English
and Chinese. Performance degradation is exace rbated by noise in the data.
Champollion was initially developed for aligning Chinese -English
parallel text. It was later ported to other language pairs, including Arabic –
English and Hindi – English.
Champollion differs from other sentence aligners in two ways. First, it
assumes a noisy input, i.e. a large percentage of alignments will not be one to one
alignments, and that the number of deletions and insertions will be significant. The
assumption is against declaring a match in the absence of lexical evidence. Non -
lexical measures, such as sentence length information – which are often unreliable
when dealing with noisy data – can and should still be used, but they should only
play a supporting role when lexical evidence is present. Second, Champollion
differs from other lexicon-based approaches in assignin g weights to translated
words. Translation lexicons usually help sentence aligners in the following way:
first, translated words are identified by usi ng entries from a translation lexicon;
11. 11
second, statistics of translated words are then used to identify sentence
correspondences.
In most existing sentence alignment algorithms, translated words are
treated equally, i.e. translated word pairs are assigned equal weight when deci ding
sentence correspondences. For example, 1-1 alignment constitutes 89% of the UBS
English-French corpus and 1-0 and 0-1 alignments constitute merely 1.3%.
However, when creating very large parallel corpora, the data can be very no isy.
For example, in a UN Chin ese English corpus, 6.4% of all alignments are either 1 -
0 or 0-1 alignment.
Some of the omissions and insertions were introduced during the
translation of the text. Most of the omissions and insertions, however, are
introduced during different stages of processing before sentence alignment is
carried out. The pre-processing steps include converting the raw data to plain text
format, removing tables, foot notes, end notes, etc. Most of these steps introduce
noise. For instance, while a table in an English document can be completely
removed, this is not necessarily the case in any given Chinese document. Because
of the sheer number of documents involved, manually examining each document
after pre-processing is impossible. A robus t sentence aligner needs not only to
detect most categories of noise, but also to recover quickly if an error is made. It
has been proved that existing methods work very well on clean data, but their
performance goes down quickly as data becomes noisy.
12. 12
CODES
Code for extracting regular text from xml file:
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
//MAXIMUM NUMBER OF PAGES ALLOWED
#define MAX 200
//EXTENSION OF THE FILES BEING CREATED FOR EACH PAGE
#define EXTENSION ".xml"
//LENGTH OF THE EXTENSION OF THE FILE
#define EXTENSION_LENGTH strlen(EXTENSION)
char temp[MAX];
//EXACT NUMBER OF PAGES IN THE SOURCE XML FILE
int totalPages;
//CONTAINS THE CURRENT PAGE NUMBER CONVERTED TO ITS CORRESPONDING FILENAME
char pageNumber[20];
//FILE POINTERS FOR READING THE PAGE FILE AND WRITING TO FINAL TEXT FILES
//TWO TEXT FILES ARE CREATED
//ONE FOR NON SORTED AND THE OTHER FOR SORTED DATA ACCORDING TO CO-
ORDINATES OF THE TEXT ON THE PAGE
FILE *fr,*fw;
//STRUCTURE FOR THE CONTENTS OF A SINGLE LINE OF THE XML FILE
struct Line
{
int top;
int left;
int width;
int height;
int font;
char text[10000];
};
//STRUCTURE FOR THE CONTENTS OF A SINGLE PAGE OF XML FILE
struct Page
{
struct Line line[MAX];
int lines;
};
//STRUCTURE FOR THE PAGE HEADER
struct Header
{
int fontId;
char fontSize[10];
char color[10];
struct Header *link;
};
typedef struct Header* HEADER;
struct Page pages[MAX];
HEADER head;
//CONTAINS THE FONTS FOR WHICH THE TEXT IS TO BE EXTRACTED
int fonts[MAX];
//CONTAINS TOTAL NUMBER OF FONTS
int totalFonts;
HEADER getHeader()
{
return((HEADER)malloc(1*sizeof(struct Header)));
}
21. 21
Word-Sense Disambiguation
(WSD)
In computational linguistics , word-sense disambiguation (WSD) is an open problem
of natural language processing , which governs the process of identifying which
sense of a word (i.e. meaning) is used in a sentence, when the word has multiple
meanings. The solution to this problem impacts other computer -related writing,
such as discourse, improving relevance of search engines, anaphora resolution,
coherence. A disambiguation process requires two strict things: a dictionary to
specify the senses which are to be disambiguated and a corpus of language data to
be disambiguated (in some methods, a training corpus of language examples is also
required). WSD task has two variants: " lexical sample" and "all words" task. The
former comprises disambiguating the occurrences of a small sample of target words
which were previously selecte d, while in the latter all the words in a piece of
running text need to be disambiguated. The latter is deemed a more realistic form
of evaluation, but the corpus is more expensive to produce because human
annotators have to read the definitions for each w ord in the sequence every time
they need to make a tagging judgement, rather than once for a block of instances
for the same target word.
To give a hint how all this works, consider two examples of the distinct
senses that exist for the (written) word " bass":
a type of fish
tones of low frequency
and the sentences:
I went fishing for some sea bass.
The bass line of the song is too weak.
22. 22
To a human, it is obvious that the first sentence is using the word " bass
(fish)", as in the former sense above and in the second sentence, the word " bass
(instrument)" is being used as in the latter sense below. Developing algorithms to
replicate this human ability can often be a difficult task, as is further exemplified
by the implicit equivocation between " bass (sound)" and "bass" (musical
instrument).
C Language Integrated Production System:
CLIPS is an expert system tool originally developed by the Software
Technology Branch (STB), NASA/Lyndon B. Johnson Space Center. Since its first
release in 1986, CLIPS has undergone continual refinement and improvement. It is
now used by thousands of people around the world. CLIPS is designed to facilitate
the development of software to model human knowledge or expertise. There are
three ways to represent knowledge in CLIPS:
• Rules, which are primarily intended f or heuristic knowledge based on experience.
• Deffunctions and generic functions, which are primarily intended for procedural
knowledge.
• Object-oriented programming , also primarily intended for procedural knowledge.
The five generally accep ted features of object -oriented programming are
supported: classes, message-handlers, abstraction, encapsulation, inheritance,
polymorphism. Rules may pattern match on objects and facts.
We can develop software using only rules, only objects, or a mixture of
objects and rules. CLIPS has also been designed for int egration with other
languages such as C and Java. Rules and objects form an integrated system too
since rules can pattern-match on facts and objects. In addition to being used as a
stand-alone tool, CLIPS can be called from a procedural language, perform i ts
function, and then return control back to the calli ng program. Likewise, procedural
code can be defined as external functions and called from CLIPS. When the
external code completes execu tion, control returns to CLIPS. CLIPS is an excellent
tool for word-sense disambiguation .
23. 23
Conclusion
MT is relatively new in India – about a decade old. In comparison with MT efforts
in Europe and Japan, which are at least 3 decades old, it would seem that Indian
MT has a long way to go. However, this can also be an advantage, because Indian
researchers can learn from the experience of their global counterparts. There are
close to a dozen projects now, with about 6 of them being in advanced prototype or
technology transfer stage, and the rest having been newly initiated.
The Indian NLP/MT scene so far has been characterized by an acute
scarcity of basic lexical resources such as corpora, MRDs, lexicons, thesauri and
terminology banks. Also, the various MT groups have used different formalisms
best suite to their specific a pplications, and hence there has been little sh aring of
resources among them. These issues are being addressed now. There are
governmental as well as voluntary efforts under way to develop common lexical
resources, and to create forums for consolidating an d coordinating NLP and MT
efforts. It appears that the exploratory phase of Indian MT is over, and the
consolidation phase is about to begin, with the focus moving from proof -of-
concept prototypes to productionization, deployment, collaborative resource
sharing and evaluation.
The core Anusaaraka output is in a language close to the target
language, and can be understood by the hu man reader after some training. The
question is how much training is nece ssary to get a very high degree of
comprehension. Our experience of working among Indian languages shows that this
training is likely to be small. Re ason for this is that India forms a linguistic area:
Indian languages share vocabulary and grammatical constructions. There are also
shared pragmatics and culture . Similar approach can be applied to build English to
Hindi Anusaaraka. A study can be conducted related to tr aining required to read
such an output. The expectation is that English to Hindi usable system can be built
except that it will require longer training.