SlideShare a Scribd company logo
1 of 10
Zipf’s Law
Dr. Babasaheb Ambedkar Technological University , Lonere,
Mangaon Dist. Raigad (M.S.) -402103
by
Mayur K. Pakhale
( Roll No. 20170783 )
Introduction
• In Natural Language Processing , Zipf's law is a law about the
frequency distribution of words in a language.
• In a collection that is large enough so that it is representative of the
language
• Zipf's law is an empirical law formulated using mathematical
statistics that refers to the fact that many types of data studied in
the physical and social sciences can be approximated with a Zipfian
distribution.
• Zipf's law was originally formulated in terms of quantitative
linguistics , stating that given some corpus of natural
Language utterances.
• The frequency of any word is inversely proportional to its rank in
the frequency table.
• The rank frequency distribution is an inverse relation.
• Count the frequency of each word type in a large corpus.
• List the word types in decreasing order of their frequency.
• Zipf’s Law:
• A relationship between the frequency of a word (f) and its
position in the list (its rank r).
f ∝ 1 /r
or, there is a constant k such that f .r = k
Zipf’s Law
Let take the corpus of “Tom Sawyer” fron nltk package of python.
• i.e. the 50th most common word should occur with 3 times the
frequency of the 150th most common word.
• Let “pr” denote the probability of word of rank r. “N” denote the total
number of word occurrences.
• pr = f/ N = A /r
• The value of A is found closer to 0.1 for corpus
Empirical Evaluation from Tom Sawyer
Observation
• Correlation: Number of meanings and word frequency.
The number of meanings m of a word obeys the law:
m ∝ p/ √ f
Given the zipf’s law
m ∝ 1 √ r
• Empirical Support
• Rank ≈ 10000, average 2.1 meanings.
• Rank ≈ 5000, average 3 meanings.
• Rank ≈ 2000, average 4.6 meanings
Zipf’s Other Law
• Correlation: Word length and word frequency.
Word frequency is inversely proportional to their length.
• The Good part :
Stopwords account for a large fraction of text, thus eliminating them greatly
reduces the number of tokens in a text.
• The Bad part :
Most words are extremely rare and thus, gathering sufficient data for meaningful
statistical analysis is difficult for most words.
Zipf’s Other Law
Thank You!

More Related Content

What's hot

Conservation and preservation of manuscripts
Conservation and preservation of manuscriptsConservation and preservation of manuscripts
Conservation and preservation of manuscriptsbrbobade
 
Library Automation & Criteria for selection Library Software
Library Automation & Criteria for selection Library SoftwareLibrary Automation & Criteria for selection Library Software
Library Automation & Criteria for selection Library SoftwareNishant Kashyap Ghatowar
 
Chain indexing
Chain indexingChain indexing
Chain indexingsilambu111
 
Informetrics final
Informetrics finalInformetrics final
Informetrics finalAamir Abbas
 
AGRIS (agricultural information system)
AGRIS (agricultural information system)AGRIS (agricultural information system)
AGRIS (agricultural information system)Abid Fakhre Alam
 
Digital library software
Digital library softwareDigital library software
Digital library softwareavid
 
Research Methods in Library and Information Science: Trends and Tips for Rese...
Research Methods in Library and Information Science: Trends and Tips for Rese...Research Methods in Library and Information Science: Trends and Tips for Rese...
Research Methods in Library and Information Science: Trends and Tips for Rese...OCLC
 
Library consortia
Library consortia Library consortia
Library consortia Dheeraj Negi
 
National social science documentation centre (nassdoc )
National social science documentation centre (nassdoc )National social science documentation centre (nassdoc )
National social science documentation centre (nassdoc )GordonAmidu
 
INFORMATION SCIENCE
INFORMATION SCIENCEINFORMATION SCIENCE
INFORMATION SCIENCEharshaec
 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval systemsilambu111
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Abhay Ratnaparkhi
 

What's hot (20)

Conservation and preservation of manuscripts
Conservation and preservation of manuscriptsConservation and preservation of manuscripts
Conservation and preservation of manuscripts
 
Library Automation & Criteria for selection Library Software
Library Automation & Criteria for selection Library SoftwareLibrary Automation & Criteria for selection Library Software
Library Automation & Criteria for selection Library Software
 
Chain indexing
Chain indexingChain indexing
Chain indexing
 
Informetrics final
Informetrics finalInformetrics final
Informetrics final
 
AGRIS (agricultural information system)
AGRIS (agricultural information system)AGRIS (agricultural information system)
AGRIS (agricultural information system)
 
Library 2.0
Library 2.0Library 2.0
Library 2.0
 
Digital library software
Digital library softwareDigital library software
Digital library software
 
BIBLIOMETRICS LAWS
BIBLIOMETRICS LAWSBIBLIOMETRICS LAWS
BIBLIOMETRICS LAWS
 
NISCAIR.pptx
NISCAIR.pptxNISCAIR.pptx
NISCAIR.pptx
 
Research Methods in Library and Information Science: Trends and Tips for Rese...
Research Methods in Library and Information Science: Trends and Tips for Rese...Research Methods in Library and Information Science: Trends and Tips for Rese...
Research Methods in Library and Information Science: Trends and Tips for Rese...
 
Areas of automation in library
Areas of automation in libraryAreas of automation in library
Areas of automation in library
 
Library consortia
Library consortia Library consortia
Library consortia
 
Bibliometrics law
Bibliometrics lawBibliometrics law
Bibliometrics law
 
National social science documentation centre (nassdoc )
National social science documentation centre (nassdoc )National social science documentation centre (nassdoc )
National social science documentation centre (nassdoc )
 
Precis
PrecisPrecis
Precis
 
IATLIS.pptx
IATLIS.pptxIATLIS.pptx
IATLIS.pptx
 
INFLIBNET
INFLIBNETINFLIBNET
INFLIBNET
 
INFORMATION SCIENCE
INFORMATION SCIENCEINFORMATION SCIENCE
INFORMATION SCIENCE
 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval system
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval
 

Similar to Zipf's Law and its Applications in NLP

Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalcaptainmactavish1996
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfJemalNesre1
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsshrey bhate
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfHabtamu100
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingSaurabh Kaushik
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Phonotactics
PhonotacticsPhonotactics
Phonotacticsvusus
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 
[DSC Europe 22] Hedonometry and big data - Petar Kocovic & Muthu Ramachandran
[DSC Europe 22] Hedonometry and big data - Petar Kocovic & Muthu Ramachandran[DSC Europe 22] Hedonometry and big data - Petar Kocovic & Muthu Ramachandran
[DSC Europe 22] Hedonometry and big data - Petar Kocovic & Muthu RamachandranDataScienceConferenc1
 
Formal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural GenomesFormal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural Genomesmadalladam
 
An Introduction To Speech Sciences (Acoustic Analysis Of Speech)
An Introduction To Speech Sciences (Acoustic Analysis Of Speech)An Introduction To Speech Sciences (Acoustic Analysis Of Speech)
An Introduction To Speech Sciences (Acoustic Analysis Of Speech)Jeff Nelson
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
1. level of language study.pptx
1. level of language study.pptx1. level of language study.pptx
1. level of language study.pptxAlkadumiHamletto
 

Similar to Zipf's Law and its Applications in NLP (20)

Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
NLP new words
NLP new wordsNLP new words
NLP new words
 
sadf
sadfsadf
sadf
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Distributional semantics
Distributional semanticsDistributional semantics
Distributional semantics
 
Phonotactics
PhonotacticsPhonotactics
Phonotactics
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
[DSC Europe 22] Hedonometry and big data - Petar Kocovic & Muthu Ramachandran
[DSC Europe 22] Hedonometry and big data - Petar Kocovic & Muthu Ramachandran[DSC Europe 22] Hedonometry and big data - Petar Kocovic & Muthu Ramachandran
[DSC Europe 22] Hedonometry and big data - Petar Kocovic & Muthu Ramachandran
 
Formal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural GenomesFormal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural Genomes
 
An Introduction To Speech Sciences (Acoustic Analysis Of Speech)
An Introduction To Speech Sciences (Acoustic Analysis Of Speech)An Introduction To Speech Sciences (Acoustic Analysis Of Speech)
An Introduction To Speech Sciences (Acoustic Analysis Of Speech)
 
FISHERposter-1
FISHERposter-1FISHERposter-1
FISHERposter-1
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
1. level of language study.pptx
1. level of language study.pptx1. level of language study.pptx
1. level of language study.pptx
 

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 

Zipf's Law and its Applications in NLP

  • 1. Zipf’s Law Dr. Babasaheb Ambedkar Technological University , Lonere, Mangaon Dist. Raigad (M.S.) -402103 by Mayur K. Pakhale ( Roll No. 20170783 )
  • 2. Introduction • In Natural Language Processing , Zipf's law is a law about the frequency distribution of words in a language. • In a collection that is large enough so that it is representative of the language • Zipf's law is an empirical law formulated using mathematical statistics that refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution.
  • 3. • Zipf's law was originally formulated in terms of quantitative linguistics , stating that given some corpus of natural Language utterances. • The frequency of any word is inversely proportional to its rank in the frequency table. • The rank frequency distribution is an inverse relation.
  • 4. • Count the frequency of each word type in a large corpus. • List the word types in decreasing order of their frequency. • Zipf’s Law: • A relationship between the frequency of a word (f) and its position in the list (its rank r). f ∝ 1 /r or, there is a constant k such that f .r = k Zipf’s Law
  • 5. Let take the corpus of “Tom Sawyer” fron nltk package of python. • i.e. the 50th most common word should occur with 3 times the frequency of the 150th most common word. • Let “pr” denote the probability of word of rank r. “N” denote the total number of word occurrences. • pr = f/ N = A /r • The value of A is found closer to 0.1 for corpus
  • 8. • Correlation: Number of meanings and word frequency. The number of meanings m of a word obeys the law: m ∝ p/ √ f Given the zipf’s law m ∝ 1 √ r • Empirical Support • Rank ≈ 10000, average 2.1 meanings. • Rank ≈ 5000, average 3 meanings. • Rank ≈ 2000, average 4.6 meanings Zipf’s Other Law
  • 9. • Correlation: Word length and word frequency. Word frequency is inversely proportional to their length. • The Good part : Stopwords account for a large fraction of text, thus eliminating them greatly reduces the number of tokens in a text. • The Bad part : Most words are extremely rare and thus, gathering sufficient data for meaningful statistical analysis is difficult for most words. Zipf’s Other Law