1. Zipf’s Law
Dr. Babasaheb Ambedkar Technological University , Lonere,
Mangaon Dist. Raigad (M.S.) -402103
by
Mayur K. Pakhale
( Roll No. 20170783 )
2. Introduction
• In Natural Language Processing , Zipf's law is a law about the
frequency distribution of words in a language.
• In a collection that is large enough so that it is representative of the
language
• Zipf's law is an empirical law formulated using mathematical
statistics that refers to the fact that many types of data studied in
the physical and social sciences can be approximated with a Zipfian
distribution.
3. • Zipf's law was originally formulated in terms of quantitative
linguistics , stating that given some corpus of natural
Language utterances.
• The frequency of any word is inversely proportional to its rank in
the frequency table.
• The rank frequency distribution is an inverse relation.
4. • Count the frequency of each word type in a large corpus.
• List the word types in decreasing order of their frequency.
• Zipf’s Law:
• A relationship between the frequency of a word (f) and its
position in the list (its rank r).
f ∝ 1 /r
or, there is a constant k such that f .r = k
Zipf’s Law
5. Let take the corpus of “Tom Sawyer” fron nltk package of python.
• i.e. the 50th most common word should occur with 3 times the
frequency of the 150th most common word.
• Let “pr” denote the probability of word of rank r. “N” denote the total
number of word occurrences.
• pr = f/ N = A /r
• The value of A is found closer to 0.1 for corpus
8. • Correlation: Number of meanings and word frequency.
The number of meanings m of a word obeys the law:
m ∝ p/ √ f
Given the zipf’s law
m ∝ 1 √ r
• Empirical Support
• Rank ≈ 10000, average 2.1 meanings.
• Rank ≈ 5000, average 3 meanings.
• Rank ≈ 2000, average 4.6 meanings
Zipf’s Other Law
9. • Correlation: Word length and word frequency.
Word frequency is inversely proportional to their length.
• The Good part :
Stopwords account for a large fraction of text, thus eliminating them greatly
reduces the number of tokens in a text.
• The Bad part :
Most words are extremely rare and thus, gathering sufficient data for meaningful
statistical analysis is difficult for most words.
Zipf’s Other Law