7 probability and statistics an introduction

Basic concepts of Probability
and Statistics
Thennarasu Sakkan
Department of Linguistics
Central University of Kerala

A probability provides a quantitative description of
the chances or likelihoods associated with various
outcomes.
Probability is the tool that statistical methods use in
order to make inferences about the characteristics of a
population given a random sample of data.
Understanding probability is therefore a key to
understand the statistics.

The probability of an event A :
P(A) = NA / N
Where N is the number of possible outcomes of the
random experiment and NA is the number of
outcomes favourable to the event A.
For example,
for a 6-sided die there are 6 outcomes and 3 of
them are even, and thus
P(even) = 3/6

Probability theory is a formal way of representing
probabilistic concepts and describing uncertain
events.
Probability is a mapping from the set of events or
sample space into the set [0, 1].
Naturally, the probability of a particular event or set
of events is the fraction of the time that the particular
event or set of events occur.
Thus, a probability mapping goes from the set of all
possible events to their respective probabilities of
occurring.

Probability’s empirical counterparts are proportions
(between 0 and 1) and percentages (between 0 and
100).
Since something must always occur, probabilities
always add up to 1 (as long as all possible events are
included in the sum).
Since no one event can happen less than 0% of the
time or more than 100% of the time, an individual
probability must be between 0 and 1.

LANGUAGE MODEL
Language modelling refers to the task of modelling the
language using probabilities.
Language model is one of the important requirements
in statistical machine translation.
This component takes care the fluency of the given
language.
i.e. how much is the given sentence probable
quantitatively; it assigns high probability to plausible
sentences.

Language model does not give any guarantee on
syntax or semantics of the language being modelled.
An n-gram is a contiguous sequence of n items from a
given sequence of text.
Let us start with word prediction using simple n-grams.
Our goal is to calculate the probability of a word w given
some history h, or mathematically Pr(w|h).
N-gram model is a widely used language modeling tool,
found crucial in applications such as SR, spelling
correction, word prediction, POS tagging, natural
language generation and word similarity.

An n-gram model
An n-gram model is a type of probabilistic model for
predicting the next item in a text sequence.
n-grams are used in various areas of statistical
natural language processing and genetic sequence
analysis.
It use the previous N-1 words in a sequence to predict
the next word.
The items in question can be phonemes, syllables,
letters, words or base pairs according to the
application.

N-gram models can be imagined as placing a small
window over a sentence or a text, in which only n
words are visible at the same time.
The simplest n-gram model is therefore a so-called
unigram model.
This is a model in which we only look at one word at
a time.
An n-gram of size 1 is referred to as a "unigram"; size
2 is a "bigram" (or, less commonly, a "digram"); size 3
is a "trigram"; and size 4 or more is simply called an
"n-gram"

http://guidetodatamining.com/ngramAnalyzer/

Collocations
The notion collocation used in lexicography in the 19th
century.
What is a collocation?
A collocation is a pair or group of words that are
often used together.
These combinations sound natural to native speakers,
but students of other language have to make a special
effort to learn them because they are often difficult
to guess.

A straightforward application of bigrams is the
identification of so-called collocations.
Recall that bigram language models exploit the
observations that words do not simply combine in
any random order, that is, word order is constraint by
grammatical structure. (e.g. phrase)
However, some combinations of words are subject to
an additional law of constraint.

Such combinations are commonly known as collocations.
– Examples of collocations are:
• United States
• vice president
• chief executive, chief office etc.
Corpus linguists study such collocations to answer
interesting questions about the combinatory properties
of words.
Collocations are a feature of natural languages that are not
well addressed by current language teaching and current
models used for NLP.

According to Benson et al, there are two types of
collocations; i) lexical and ii) grammatical
collocations.
i) lexical collocations such as
noun + noun,
adjective + noun,
ii) Grammatical collocations such as
noun + suffixes etc.

See collocations of panam ‘money’ in
Tamil

How to generate collocation out of a
corpus text?….
To take a list of modern collocations….

POS tagging and approaches
Part of Speech (POS) tagging is the process of labeling
a Part of Speech category to each and every word in
a text.
POS tagging is considered to be an important process
in speech recognition, natural language parsing,
morphological parsing, information retrieval and
machine translation.
Automatic Part-of-Speech tagger can help in
building automatic word-sense disambiguating
algorithms.

Parts of Speech are very often used for shallow parsing
texts, or for finding Noun and other phrases for
information extraction applications.
The corpora that have been marked for Part-of-
Speech are very useful for linguistic research,
For example, to find frequencies of a particular word
or sentence constructions in large corpora.
Apart from these, many Natural Language Processing
(NLP) activities such as summarization, Natural
Language Understanding (NLU) and Question
Answering (QA) systems are dependent on Part-of-
Speech Tagging.

Approaches to POS Tagging
POS taggers are broadly classified into three categories
called rule based, Empirical based and Hybrid based.
In case of rule based approach hand-written rules
are used to distinguish the tag ambiguity.
The empirical POS taggers are further classified
into Example based and Stochastic based taggers.

Stochastic taggers are either HMM based, choosing the
tag sequence which maximizes the product of word
likelihood and tag sequence probability, or cue-based,
using decision trees or maximum entropy models to
combine probabilistic features.
The stochastic taggers are further classified in to
supervised and unsupervised taggers. Each of these
supervised and unsupervised taggers are categorized
into different groups as below:

Maximum Entropy Part of Speech Tagger
by Standford University

POS Tagging
UnsupervisedSupervised
Rule Based Stochastic Neural Rule Based Stochastic Neural
Brill Brill
N-gram
based
Maximum
Likelihood
Hidden Markov
Model
Baum-Welch
Algorithm
Viterbi
Algorithm
Classification of POS tagging models

Rule-based taggers generally involve a large database
of hand-written disambiguation rules.
For example, that an ambiguous word is a noun rather
than a verb if it follows a determiner.
Among those rule-based part-of-speech taggers, the
one built by Brill has the advantage of learning
tagging rules automatically.
Stochastic taggers generally resolve tagging
ambiguities by using a training corpus to compute the
probability of a given word having a given tag in a
given context.

Supervised POS tagging
The supervised POS tagging models require pre-
tagged corpora which are used for training to learn
rule sets, information about the tagset, word-tag
frequencies etc.
The learning tool generates trained models along
with the statistical information.
The performance of the models generally increases
with increase in the size of pre-tagged corpus.

Unsupervised POS tagging
Unlike the supervised models, the unsupervised POS
tagging models do not require a pre-tagged corpus.
Instead, they use advanced computational methods
like the Baum-Welch algorithm to automatically induce
tagsets, transformation rules etc.
Based on the information, they either calculate the
probabilistic information needed by the stochastic
taggers or induce the contextual rules needed by rule-
based systems or transformation based systems.

Rule based POS tagging
The rule based POS tagging models apply a set of hand written
rules and use contextual information to assign POS tags to
words in a sentence.
These rules are often known as context frame rules. For example,
a context frame rule might say something like:
“If an ambiguous/unknown word X is preceded by a Determiner
and followed by a Noun, tag it as an Adjective.”
On the other hand, the transformation based approaches use a
pre-defined set of handcrafted rules as well as automatically
induced rules that are generated during training.

Some models also use information about capitalization and
punctuation, the usefulness of which are largely dependent
on the language being tagged.
The earliest algorithms for automatically assigning Part-of-
Speech were based on a two-stage architecture [Harris Z. S,
1962].
The first stage used a dictionary to assign each word a list of
potential parts of speech.
The second stage used large lists of hand-written disambiguation
rules to bring down this list to a single Part-of-Speech for each
word.

The ENGTWOL [Voutilainen Atro, 1995] tagger is based on the
same two-stage architecture, although both the lexicon and the
disambiguation rules are much more sophisticated than the early
algorithms.
The ENGTWOL lexicon is based on the two-level morphology.
It has about 56,000 entries for English word stems, counting a
word with multiple parts of speech (e.g. nominal and verbal
senses of hit) as separate entries, and of course not counting
inflected and many derived forms.
Each entry is annotated with a set of morphological and
syntactic features. In the first stage of the tagger, each word is
run through the two-level lexicon transducer and the entries for
all possible parts of speech are returned.

Stochastic POS tagging
A stochastic approach includes frequency, probability or
statistics. The simplest stochastic approach finds out the
most frequently used tag for a specific word in the
annotated training data and uses this information to tag
that word in the unannotated text.
The problem with this approach is that it can come up with
sequences of tags for sentences that are not acceptable
according to the grammar rules of a language.

An alternative to the word frequency approach is known as the
n-gram approach that calculates the probability of a given
sequence of tags.
It determines the best tag for a word by calculating the
probability that it occurs with the n previous tags, where the
value of n is set to 1, 2 or 3 for practical purposes.
The most common algorithm for implementing an n-gram
approach for tagging a new text is known as the Viterbi
Algorithm, which is a search algorithm that avoids the
polynomial expansion of a breadth first search by trimming
the search tree at each level using the best m Maximum
Likelihood Estimates (MLE) where m represents the number
of tags of the following word.
These are known as the unigram, bigram and trigram models.

• Very robust, can process any input strings
• Training is automatic, very fast
• Can be retrained for different corpora/tagsets
without much effort
• Language independent
• Minimize the human effort and human error.
http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/viterbi_al
gorithm/s1_pg1.html
Advantages of Statistical Approach

Apart from these, quiet a few different approaches to
tagging have been developed.
Support Vector Machines: This is the powerful machine
learning method used for various applications in NLP and
other areas like bio-informatics, data mining, etc.
Neural Networks: These are potential candidates for the
classification task since they learn abstractions from
examples [Schmid H, 1994].
Decision Trees:
A decision tree is a decision support tool that uses a tree-
like graph. It is one way to display an algorithm.

These are classification devices based on hierarchical
clusters of questions. They have been used for natural
language processing such as POS Tagging [Schmid
H, 1994].
The software “Weka” can be used for classifying the
ambiguous words.

Maximum Entropy Models: These avoid certain
problems of statistical interdependence and have
proven successful for tasks such as parsing and
POS tagging.
Example-Based Techniques: These techniques find
the training instance that is most similar to the
current problem instance and assume the same
class for the new problem instance as for the
similar one.

Freely downloadable Part of Speech Taggers for English
and other languages
Stanford POS tagger
Loglinear tagger in Java (by Kristina Toutanova)
hunpos
An HMM tagger with models available for English and
Hungarian. A reimplementation of TnT (see below) in
OCaml. pre-compiled models. Runs on Linux, Mac OS X,
and Windows.
MBT: Memory-based Tagger
Based on TiMBL
TreeTagger
http://nlp.stanford.edu/links/statnlp.html

• A decision tree based tagger from the University of
Stuttgart is language independent, but comes complete
with parameter files for English, German, Italian, Dutch,
French, Old French, Spanish, Bulgarian, and Russian.
(Linux, Sparc-Solaris, Windows, and Mac OS X versions.
Binary distribution only.) Page has links to sites where
one can run it online.

SVMTool
POS Tagger based on SVMs (uses SVMlight). LGPL.
ACOPOST (formerly ICOPOST)
Open source C taggers originally written by Ingo
Schröder. Implements maximum entropy, HMM trigram,
and transformation-based learning. C source available
under GNU public license.
MXPOST
Adwait Ratnaparkhi's Maximum Entropy part of
speech tagger
Java POS tagger
A sentence boundary detector (MXTERMINATOR)
is also included. Original version was only JDK1.1; later
version worked with JDK1.3+. Class files, not source.

fnTBL
A fast and flexible implementation of Transformation-
Based Learning in C++. Includes a POS tagger, but also
NP chunking and general chunking models.
mu-TBL
An implementation of a Transformation-based Learner
(a la Brill), usable for POS tagging and other things by
Torbjörn Lager. Web demo also available.
YamCha
SVM-based NP-chunker, also usable for POS tagging, NER,
etc. C/C++ open source. Won CoNLL 2000 shared task.
(Less automatic than a specialized POS tagger for an end
user.)

QTAG Part of speech tagger
An HMM-based Java POS tagger from
Birmingham U. (Oliver Mason). English and
German parameter files. [Java class files, not
source.]
The TOSCA/LOB tagger.
Currently available for MS-DOS only. But the
decision to make this famous system available is very
interesting from an historical perspective, and for software
sharing in academia more generally. LOB tag set.
Brill's Transformation-based learning Tagger
A symbolic tagger, written in C. It's no longer available from a
canonical location, but one may find a version from the
Wikipedia page or one can try a reimplementation such as
fnTBL.

• Original Xerox Tagger
A common lisp HMM tagger available by ftp.
Lingua-EN-Tagger
Perl POS tagger by Maciej Ceglowski and Aaron
Coburn. Version 0.11. (A bigram HMM tagger.)

Development of POS Annotated Corpora
Corpus linguistics seeks to further the understanding of
language through the analysis of large quantities of naturally
occurring data.
Text corpora are used in a number of different ways.
Traditionally, corpora have been used for the study and analysis
of language at different levels of linguistic description.
Corpora have been constructed for the specific purpose of
acquiring knowledge for information extraction systems,
knowledge-based systems and e-business systems.
Corpora have been used for studying child language
development. Speech corpora play a vital role in the
specification, design and implementation of telephonic
communication and for the broadcast media.

There is a long tradition of corpus linguistic studies in
Europe. The need for corpus for a language is
multifarious(various types).
Starting from the preparation of a dictionary or lexicon to
machine translation, corpus has become an inevitable resource
for technological development of languages.
Corpus means a body of huge text incorporating various
types of textual materials, including newspaper, weeklies,
fictions, scientific writings, literary writings, and so on.
Corpus represents all the styles of a language. Corpus must
be very huge in size as it is going to be used for many
language applications such as preparation of lexicons of
different sizes, purposes and types, NLP tools, machine
translation programs and so on.

Corpuses can be distinguished as tagged corpus, parallel
corpus and aligned corpus.
The tagged corpus is that which is tagged for Part-of-Speech,
morphology, lemma, phrases etc.
A parallel corpus contains texts and translations in each of the
languages involved in it. It allows wider scopes for double-
checking of the translation equivalents.
Aligned corpus is a kind of bilingual corpus where text
samples of one language and their translations into another
language are aligned, sentence by sentence, phrase by
phrase, word by word, or even character by character.

Applications of POS tagged corpus
The POS tagged corpus is used in the following task.
– Chunking
– Parsing
– Information extraction and retrieval
– Tree bank creation
– Document classification
– Question answering

Applications of POS tagged corpus cont…
– Automatic dialogue system
– Speech processing
– Summarization
– Statistical training of Language models
– Machine Translation using multilingual corpora
– Text checkers for evaluating spelling and grammar
– Computer Lexicography
– Educational application like Computer Assisted
Language Learning

Complexity in Dravidian POS tagging
As Dravidian is an agglutinative language, Nouns get
inflected for number and cases. Verbs get inflected for
various inflections which include tense, person, number,
gender suffixes.
Verbs are adjectivalized and adverbialized. Also verbs
and adjectives are nominalized by means of certain
nominalizers. Adjectives and adverbs do not inflect.
Many post-positions in Tamil [Arden 1942; Rajendran S,
2007] are from nominal and verbal sources. So, many
times one has to depend on the syntactic function or
context to decide upon whether one is a noun or adjective
or adverb or postposition.

This leads to the complexity of Tamil in POS tagging.
Root ambiguity
The root word can be ambiguous. It can have more than one
sense, sometimes roots belong to more than one POS
category.
Though the POS can be disambiguated using contextual
information like co-occurring morphemes, it is not possible
always.
These issues should be taken care of when POS Taggers are
built for Tamil Language.
For example, the Tamil root words like adi, padi, isai, mudi,
kudi can take both noun and verb category which leads to the
root ambiguity problem in POS tagging.

Noun complexity
Nouns are the words which denote a person, place, thing,
time, etc. In Tamil language, nouns are inflected for the
number and case in morphological level.
Morphological level inflection
Noun ( + number ) (+ case )
Example: pUk-kaL-ai <NN>
Flower-plural-accusative case suffix
Noun ( + number ) (+ oblique) (+ euphonic) (+ case )
Example: pUk-kaL-in-Al <NN>
Flower-plural-euphonic suffix-accusative case suffix
Nouns further need to be annotated into common noun,
compound noun, proper noun, compound proper noun,
pronoun, cardinal and ordinal.

Pronouns need to be further annotated for personal pronoun.
There occurs complexity between common noun and
compound noun and also between proper noun and
compound proper noun. Common noun can also occur as
compound noun, for example
UrAdci <NNC> thalaivar <NNC>
When UrAdci and thalaivar comes together it can be
compound noun (<NNC>), but when UrAdci and thalaivar
comes separately in a sentence it should be tagged as a
common noun (<NN>). Such complexity also occurs with the
proper noun <NNP> and compound proper noun (<NNPC>).
Moreover there occurs complexity between noun and adverb,
pronoun and emphasis in syntactic level.

Verb complexity
The verbal forms are complex in Tamil. A finite verb
shows the following morphological structure
Verb stem + Tense + Person-Number + Gender
Example: nada +nth +En <VF>
‘I walked’
A number of non-finite forms are possible: adverbial forms,
adjectival forms, infinitive forms, and conditional.
Verb stem + Adverbial participle
Example: cey + thu = ceythu <VNAV>
‘having done’

Verb stem + relative_participle
Example: cey + tha = ceytha <VNAJ>
‘who did’
Verb stem + infinitive suffix
Example: azu + a = aza <VINT>
‘to weep’
Verb stem + conditional suffix
Example: kEL+d + Al =kEddAl <CVB>
‘if asked’
Distinction needs to be made between a main verb followed
by a main verb and a main Verb followed by an auxiliary
verb.
The main verb followed by an auxiliary verb need to be
interpreted together, whereas the main verb followed by a
main verb need to be interpreted separately. This lead to
functional ambiguity as given below:

Developing Part-of- Speech tagger for
Indian languages
For Bengali, Sandipan et al., (2007), have developed a
corpus based semi-supervised learning algorithm for POS
tagging based on HMMs.
Their system uses a small tagged corpus (500 sentences) and a
large unannotated corpus along with a Bengali morphological
analyzer. When tested on a corpus of 100 sentences (1003
words), their system obtained an accuracy of 95%.

Smriti Singh et.al (2006), have proposed tagger for Hindi, that
uses the affix information stored in a word and assigns a
POS tag using no contextual information. By considering
the previous and the next word in the Verb Group (VG), it
correctly identifies the main verb and the auxiliaries.
Lexicon lookup was used for identifying the other POS
categories.
In NLPAI ML contest, Dalal et al (2006) have achieved
accuracies of 82.22 % and 82.4% for Hindi POS tagging and
chunking respectively using maximum entropy models.
Karthik et al. (2006) got 81.59 % accuracy for Telugu POS tagging
using HMMs.
Sivaji et al (2006) came up with a rule based chunker for Bengali
which gave an accuracy of 81.64 %. The training data for all the
three languages contained approximately 20,000 words and the
testing data had approximately 5000 words.

For Telugu, three POS taggers have been proposed by using
different POS tagging approaches viz., (1) Rule-based
approach, (2) Transformation based learning (TBL)
approach of Erich Brill (3) Maximum Entropy Model, a
machine learning technique [Ramasree, R.J and Kusuma
Kumari, P, 2007].
Hidden Markov Model (HMM) based tagger for Hindi was
proposed by Manish Shrivastava and Pushpak Bhattacharyya
(2008). The authors attempted to utilize the morphological
richness of the languages without resorting to complex and
expensive analysis. The core idea of their approach was to
explode the input in order to increase the length of the input
and to reduce the number of unique types encountered during
learning. This in turn increases the probability score of the
correct choice while simultaneously decreasing the ambiguity
of the choices at each stage.

A stochastic Hidden Markov Model (HMM) based part of
speech tagger has been proposed for Malayalam. To perform
parts of tagging speech using stochastic approach, an annotated
corpus is needed. Due to the non-availability of annotated
corpus, a morphological analyzer was also developed to
generate a tagged corpus from the training set [Manju K e.tal,
2009].
Various methodologies have been developed for POS Tagging
for Tamil language. A rule-based POS tagger for Tamil was
developed and tested [Arulmozhi et al., 2004]. This system
gives only the major tags and the sub tags are overlooked
during evaluation. A hybrid POS tagger for Tamil using HMM
technique and a rule based system was also developed
[Arulmozhi P and Sobha L, 2006].

Lakshmana Pandian S and Geetha T V (2008) have developed a
Morpheme based Language Model for Tamil Part-of-Speech
Tagging. A language model based on the information of the
stem type, last morpheme, and previous to the last morpheme
part of the word for categorizing its part of speech was
developed. For estimating the contribution factors of the model,
they have followed the generalized iterative scaling technique.
Dhanalakshmi et. al.(2008) proposed an SVM based tagger
using linear programming and developed their own POS tagset
for Tamil which has 32 tags. They used this tagset to annotate
their corpus and then trained their model and reported an
accuracy of 95.63%. Dhanalakshmi et. al.(2009) have also
proposed another tagger where they used machine learning
techniques to extract linguistic information which was then
used to train the tagger based on SVM approach. They used
their own 32 tags tagset for annotating the corpus and reported
an accuracy of 95.64%.

Considerable Effort of developing a POS Tagger in other
Indian Languages have also been put in for Malayalam, an
HMM based tagger was proposed by Manju et. al., since
they did not had an annotated corpus, they used a
morphological analyzer to generate the corpus which was
then used for training the HMM algorithm. Another tagger
for Malayalam was developed by Anthony et. al. [2009] who
used Support Vector Machines (SVM). They used a
SVMTool for tagging which was developed by Giménez and
Màrquez. For developing this tagger Anthony et. al. first
proposed a tagset which they claim is suitable for
Malayalam and then created an annotated corpus using this
tagset. Their tagger reported 94% accuracy with their tagset.

Word Sense Disambiguation
• Word sense disambiguation (WSD) is the ability to
identify the meaning of words in context in a
computational manner. WSD is considered an AI-
complete problem, that is, a task whose solution is at least
as hard as the most difficult problems in artificial
intelligence.
A striking feature of Natural Language is that many words
and sentences have more than one meaning (i.e. are
semantically ambiguous), and which meaning is correct
depends on the context. This problem arises at several
levels.

There are problems at the level of individual words. Consider
this example
The man went to the (old ladies hostel)/bank.
What kind of 'bank'? A river bank or a source of money or blood
bank? Here we have three distinct English words with the
same spelling/pronunciation.
Word sense disambiguation (WSD) is the problem of
determining in which sense a word having a number of distinct
senses is used in a given sentence. So, WSD is a task of
removing the ambiguity of word in context.

7 probability and statistics an introduction

7 probability and statistics an introduction

More Related Content

What's hot

Similar to 7 probability and statistics an introduction

More from ThennarasuSakkan

Recently uploaded

7 probability and statistics an introduction