SlideShare a Scribd company logo
1 of 46
Download to read offline
Statistical Tools for Linguists

        Cohan Sujay Carlos
           Aiaioo Labs
           Bangalore
Text Analysis and Statistical Methods

 • Motivation
 • Statistics and Probabilities
 • Application to Corpus Linguistics
Motivation
• Human Development is all about Tools
  – Describe the world
  – Explain the world
  – Solve problems in the world
• Some of these tools
  – Language
  – Algorithms
  – Statistics and Probabilities
Motivation – Algorithms for Education Policy

 • 300 to 400 million people are illiterate
 • If we took 1000 teachers, 100 students per
   class, and 3 years of teaching per student

   –12000 years
 • If we had 100,000 teachers

   –120 years
Motivation – Algorithms for Education Policy

 • 300 to 400 million people are illiterate
 • If we took 1 teacher, 10 students per class,
   and 3 years of teaching per student.
 • Then each student teaches 10 more students.

   – about 30 years
 • We could turn the whole world literate in

   – about 34 years
Motivation – Algorithms for Education Policy


 Difference:

 Policy 1 is O(n) time
 Policy 2 is O(log n) time
Motivation – Statistics for Linguists

 We have shown that:
 Using a tool from computer science, we can
 solve a problem in quite another area.

                   SIMILARLY

 Linguists will find statistics to be a handy tool
 to better understand languages.
Applications of Statistics to Linguistics


     • How can statistics be useful?
     • Can probabilities be useful?
Introduction to Aiaioo Labs
• Focus on Text Analysis, NLP, ML, AI
• Applications to business problems
• Team consists of
  – Researchers
     • Cohan
     • Madhulika
     • Sumukh
  – Linguists
  – Engineers
  – Marketing
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
  – Wordnet
  – Google terabyte corpus (with annotations?)
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
  – Wordnet (set of rules about the real world)
  – Google terabyte corpus (real world)
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
    – Wordnet (not countable)
    – Google terabyte corpus (countable)



For training machine learning algorithms, the latter might be more valuable,
just because it is possible to tally up evidence on the latter corpus.

Of course I am simplifying things a lot and I don’t mean that the former is not
valuable at all.
Approach to corpus construction
 So if you are constructing a corpus on
 which machine learning methods might
 be applied, construct your corpus so that
 you retain as many examples of surface
 forms as possible.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
Problem : Spelling

1.   Field
2.   Wield
3.   Shield
4.   Deceive
5.   Receive
6.   Ceiling

                       Courtesy of http://norvig.com/chomsky.html
Rule-based Approach


    “I before E except after C”

-- an example of a linguistic insight




                 Courtesy of http://norvig.com/chomsky.html
Probabilistic Statistical Model:
• Count the occurrences of ‘ie’ and ‘ei’ and ‘cie’
  and ‘cei’ in a large corpus

P(IE) = 0.0177
P(EI) = 0.0046
P(CIE) = 0.0014
P(CEI) = 0.0005

                         Courtesy of http://norvig.com/chomsky.html
Words where ie occur after c
•   science
•   society
•   ancient
•   species




                   Courtesy of http://norvig.com/chomsky.html
But you can go back to a Rule-based
             Approach


  “I before E except after C only if C is not
               preceded by an S”

    -- an example of a linguistic insight


                      Courtesy of http://norvig.com/chomsky.html
What is a probability?

• A number between 0 and 1
• The sum of the probabilities on all outcomes is 1

Heads                  Tails




• P(heads) = 0.5
• P(tails) = 0.5
Estimation of P(IE)



P(“IE”) = C(“IE”) / C(all two letter sequences in my corpus)
What is Estimation?



P(“UN”) = C(“UN”) / C(all words in my corpus)
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
How do you annotate?
• The problem: ‘named entity classification’
• What is better?
  – Per, Org, Loc, Prod, Time
  – Right, Wrong
How do you annotate?
• The problem: ‘named entity classification’
• What is better?
  – Per, Org, Loc, Prod, Time
  – Right, Wrong



      It depends on whether you care about
      precision or recall or both.
What are Precision and Recall



 Classification metrics used to compare ML
 algorithms.
Classification Metrics

     Politics                   Sports

The UN Security            Warwickshire's Clarke
Council adopts its first   equalled the first-class
clear condemnation of      record of seven

    How do you compare two ML algorithms?
Classification Quality Metrics
               Point of view = Politics

                            Gold - Politics       Gold - Sports

Observed - Politics   TP (True Positive)      FP (False Positive)


Observed - Sports     FN (False Negative)     TN (True Negative)
Classification Quality Metrics
               Point of view = Sports

                           Gold - Politics       Gold - Sports

Observed - Politics   TN (True Negative)     FN (False Positive)


Observed - Sports     FP (False Negative)    TP (True Positive)
Classification Quality Metric - Accuracy
                   Point of view = Sports

                               Gold - Politics      Gold – Sports

    Observed - Politics   TN (True Negative)     FN (False Positive)


    Observed - Sports     FP (False Negative)    TP (True Positive)
Metrics for Measuring Classification Quality
                      Point of View – Class 1

                              Gold Class 1               Gold Class 2

   Observed Class 1     TP                          FP


   Observed Class 2     FN                          TN




                Great metrics for highly unbalanced corpora!
Metrics for Measuring Classification Quality




  F-Score = the harmonic mean of Precision and Recall
F-Score Generalized


            1
 F
       1           1
        (1   )
       P           R
Precision, Recall, Average, F-Score

                 Precision          Recall         Average           F-Score

 Classifier 1   50%              50%             50%                50%


 Classifier 2   30%              70%             50%                42%


 Classifier 3   10%              90%             50%                18%




                 What is the sort of classifier that fares worst?
How do you annotate?
So if you are constructing a corpus for a
machine learning tool where only
precision matters, all you need is a corpus
of presumed positives that you mark as
right or wrong (or the label and other).

If you need to get good recall as well, you
will need a corpus annotated with all the
relevant labels.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
How much data should you annotate?
  • The problem: ‘named entity classification’
  • What is better?
    – 2000 words per category (each of Per, Org,
      Loc, Prod, Time)
    – 5000 words per category (each of Per, Org,
      Loc, Prod, Time)
Small Corpus – 4 Fold Cross-Validation

          Split          Train Folds         Test Fold

       First Run    • 1, 2, 3          • 4


       Second Run   • 2, 3, 4          • 1


       Third Run    • 3, 4, 1          • 2


       Fourth Run   • 4, 1, 2          • 3
Statistical significance in a paper


                              significance         estimate
                                                         variance




    Remember to take Inter-Annotator Agreement into account
How much do you annotate?
So you increase the corpus size till that
the error margins drop to a value that the
experimenter considers sufficient.

The smaller the error margins, the finer
the comparisons the experimenter can
make between algorithms.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the
    corpus
Avoid Mistakes
• The problem: ‘train a classifier’
• What is better?
  – Train with all the data that you have, and
    then test on all the data that you have?
  – Train on half and test on the other half?
Avoid Mistakes
• Training a corpus on a full corpus and
  then running tests using the same corpus
  is a bad idea because it is a bit like
  revealing the questions in the exam
  before the exam.
• A simple algorithm that can game such a
  test is a plain memorization algorithm
  that memorizes all the possible inputs
  and the corresponding outputs.
Corpus Splits

        Split            Percentage

Training        • 60%


Validation      • 20%


Testing         • 20%


Total           • 100%
How do you avoid mistakes?
Do not train a machine learning algorithm on the
‘testing’ section of the corpus.

During the development/tuning of the algorithm,
do not make any measurements using the
‘testing’ section, or you’re likely to ‘cheat’ on the
feature set, and settings. Use the ‘validation’
section for that.

I have seen researchers claim 99.7% accuracy on
Indian language POS tagging because they failed
to keep the different sections of their corpus
sufficiently well separated.

More Related Content

What's hot

Kelompok 6 semprag (cooperation and implicature)
Kelompok 6 semprag (cooperation and implicature)Kelompok 6 semprag (cooperation and implicature)
Kelompok 6 semprag (cooperation and implicature)
donawidiya
 
Semantics
SemanticsSemantics
Semantics
kapandu
 
Theory Of Semiotics
Theory Of SemioticsTheory Of Semiotics
Theory Of Semiotics
Halfmazin
 
Derrida-Hegel: Différance-Difference
Derrida-Hegel: Différance-DifferenceDerrida-Hegel: Différance-Difference
Derrida-Hegel: Différance-Difference
Melanie Swan
 

What's hot (20)

Semiotics
SemioticsSemiotics
Semiotics
 
Sense relation
Sense relationSense relation
Sense relation
 
Reading media through sign, signifier and signified
Reading media through sign, signifier and signifiedReading media through sign, signifier and signified
Reading media through sign, signifier and signified
 
Kelompok 6 semprag (cooperation and implicature)
Kelompok 6 semprag (cooperation and implicature)Kelompok 6 semprag (cooperation and implicature)
Kelompok 6 semprag (cooperation and implicature)
 
Pragmatics
PragmaticsPragmatics
Pragmatics
 
Semantics
SemanticsSemantics
Semantics
 
Theory Of Semiotics
Theory Of SemioticsTheory Of Semiotics
Theory Of Semiotics
 
Eugine Nida
Eugine NidaEugine Nida
Eugine Nida
 
Pragmatics
PragmaticsPragmatics
Pragmatics
 
Chap 4 1
Chap 4  1Chap 4  1
Chap 4 1
 
Semiotics
SemioticsSemiotics
Semiotics
 
Structuralism
StructuralismStructuralism
Structuralism
 
Culture and language
Culture and languageCulture and language
Culture and language
 
Phenomenology
PhenomenologyPhenomenology
Phenomenology
 
Reference And Inference By Dr.Shadia.Pptx
Reference And Inference  By Dr.Shadia.PptxReference And Inference  By Dr.Shadia.Pptx
Reference And Inference By Dr.Shadia.Pptx
 
INTRODUCTION TO SEMANTIC
INTRODUCTION TO SEMANTIC INTRODUCTION TO SEMANTIC
INTRODUCTION TO SEMANTIC
 
Context and Its Significance to Pragmatics
Context and Its Significance to PragmaticsContext and Its Significance to Pragmatics
Context and Its Significance to Pragmatics
 
Structuralism and Saussure
Structuralism and SaussureStructuralism and Saussure
Structuralism and Saussure
 
Derrida-Hegel: Différance-Difference
Derrida-Hegel: Différance-DifferenceDerrida-Hegel: Différance-Difference
Derrida-Hegel: Différance-Difference
 
Functional Stylistics jayron-bermejo
Functional Stylistics jayron-bermejoFunctional Stylistics jayron-bermejo
Functional Stylistics jayron-bermejo
 

Similar to Statistics for linguistics

Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
Lifeng (Aaron) Han
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
Lifeng (Aaron) Han
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
Lifeng (Aaron) Han
 

Similar to Statistics for linguistics (20)

NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...
 
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
R Data Structures (Part 1)
R Data Structures (Part 1)R Data Structures (Part 1)
R Data Structures (Part 1)
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.
 
Survey Research in Software Engineering
Survey Research in Software EngineeringSurvey Research in Software Engineering
Survey Research in Software Engineering
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 

More from aiaioo

More from aiaioo (10)

Document Analysis with Deep Learning
Document Analysis with Deep LearningDocument Analysis with Deep Learning
Document Analysis with Deep Learning
 
Deep Learning through Pytorch Exercises
Deep Learning through Pytorch ExercisesDeep Learning through Pytorch Exercises
Deep Learning through Pytorch Exercises
 
Learning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text ClassificationLearning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text Classification
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Tools
 
Fun with Text - Managing Text Analytics
Fun with Text - Managing Text AnalyticsFun with Text - Managing Text Analytics
Fun with Text - Managing Text Analytics
 
Arduino for Indian Languages
Arduino for Indian LanguagesArduino for Indian Languages
Arduino for Indian Languages
 
Fun with Text - Hacking Text Analytics
Fun with Text - Hacking Text AnalyticsFun with Text - Hacking Text Analytics
Fun with Text - Hacking Text Analytics
 
Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)
 
Rules engines to machine learning
Rules engines to machine learningRules engines to machine learning
Rules engines to machine learning
 
Aiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly FuturisticAiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly Futuristic
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Statistics for linguistics

  • 1. Statistical Tools for Linguists Cohan Sujay Carlos Aiaioo Labs Bangalore
  • 2. Text Analysis and Statistical Methods • Motivation • Statistics and Probabilities • Application to Corpus Linguistics
  • 3. Motivation • Human Development is all about Tools – Describe the world – Explain the world – Solve problems in the world • Some of these tools – Language – Algorithms – Statistics and Probabilities
  • 4. Motivation – Algorithms for Education Policy • 300 to 400 million people are illiterate • If we took 1000 teachers, 100 students per class, and 3 years of teaching per student –12000 years • If we had 100,000 teachers –120 years
  • 5. Motivation – Algorithms for Education Policy • 300 to 400 million people are illiterate • If we took 1 teacher, 10 students per class, and 3 years of teaching per student. • Then each student teaches 10 more students. – about 30 years • We could turn the whole world literate in – about 34 years
  • 6. Motivation – Algorithms for Education Policy Difference: Policy 1 is O(n) time Policy 2 is O(log n) time
  • 7. Motivation – Statistics for Linguists We have shown that: Using a tool from computer science, we can solve a problem in quite another area. SIMILARLY Linguists will find statistics to be a handy tool to better understand languages.
  • 8. Applications of Statistics to Linguistics • How can statistics be useful? • Can probabilities be useful?
  • 9. Introduction to Aiaioo Labs • Focus on Text Analysis, NLP, ML, AI • Applications to business problems • Team consists of – Researchers • Cohan • Madhulika • Sumukh – Linguists – Engineers – Marketing
  • 10. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 11. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet – Google terabyte corpus (with annotations?)
  • 12. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet (set of rules about the real world) – Google terabyte corpus (real world)
  • 13. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet (not countable) – Google terabyte corpus (countable) For training machine learning algorithms, the latter might be more valuable, just because it is possible to tally up evidence on the latter corpus. Of course I am simplifying things a lot and I don’t mean that the former is not valuable at all.
  • 14. Approach to corpus construction So if you are constructing a corpus on which machine learning methods might be applied, construct your corpus so that you retain as many examples of surface forms as possible.
  • 15. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 16. Problem : Spelling 1. Field 2. Wield 3. Shield 4. Deceive 5. Receive 6. Ceiling Courtesy of http://norvig.com/chomsky.html
  • 17. Rule-based Approach “I before E except after C” -- an example of a linguistic insight Courtesy of http://norvig.com/chomsky.html
  • 18. Probabilistic Statistical Model: • Count the occurrences of ‘ie’ and ‘ei’ and ‘cie’ and ‘cei’ in a large corpus P(IE) = 0.0177 P(EI) = 0.0046 P(CIE) = 0.0014 P(CEI) = 0.0005 Courtesy of http://norvig.com/chomsky.html
  • 19. Words where ie occur after c • science • society • ancient • species Courtesy of http://norvig.com/chomsky.html
  • 20. But you can go back to a Rule-based Approach “I before E except after C only if C is not preceded by an S” -- an example of a linguistic insight Courtesy of http://norvig.com/chomsky.html
  • 21. What is a probability? • A number between 0 and 1 • The sum of the probabilities on all outcomes is 1 Heads Tails • P(heads) = 0.5 • P(tails) = 0.5
  • 22. Estimation of P(IE) P(“IE”) = C(“IE”) / C(all two letter sequences in my corpus)
  • 23. What is Estimation? P(“UN”) = C(“UN”) / C(all words in my corpus)
  • 24. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 25. How do you annotate? • The problem: ‘named entity classification’ • What is better? – Per, Org, Loc, Prod, Time – Right, Wrong
  • 26. How do you annotate? • The problem: ‘named entity classification’ • What is better? – Per, Org, Loc, Prod, Time – Right, Wrong It depends on whether you care about precision or recall or both.
  • 27. What are Precision and Recall Classification metrics used to compare ML algorithms.
  • 28. Classification Metrics Politics Sports The UN Security Warwickshire's Clarke Council adopts its first equalled the first-class clear condemnation of record of seven How do you compare two ML algorithms?
  • 29. Classification Quality Metrics Point of view = Politics Gold - Politics Gold - Sports Observed - Politics TP (True Positive) FP (False Positive) Observed - Sports FN (False Negative) TN (True Negative)
  • 30. Classification Quality Metrics Point of view = Sports Gold - Politics Gold - Sports Observed - Politics TN (True Negative) FN (False Positive) Observed - Sports FP (False Negative) TP (True Positive)
  • 31. Classification Quality Metric - Accuracy Point of view = Sports Gold - Politics Gold – Sports Observed - Politics TN (True Negative) FN (False Positive) Observed - Sports FP (False Negative) TP (True Positive)
  • 32. Metrics for Measuring Classification Quality Point of View – Class 1 Gold Class 1 Gold Class 2 Observed Class 1 TP FP Observed Class 2 FN TN Great metrics for highly unbalanced corpora!
  • 33. Metrics for Measuring Classification Quality F-Score = the harmonic mean of Precision and Recall
  • 34. F-Score Generalized 1 F 1 1   (1   ) P R
  • 35. Precision, Recall, Average, F-Score Precision Recall Average F-Score Classifier 1 50% 50% 50% 50% Classifier 2 30% 70% 50% 42% Classifier 3 10% 90% 50% 18% What is the sort of classifier that fares worst?
  • 36. How do you annotate? So if you are constructing a corpus for a machine learning tool where only precision matters, all you need is a corpus of presumed positives that you mark as right or wrong (or the label and other). If you need to get good recall as well, you will need a corpus annotated with all the relevant labels.
  • 37. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 38. How much data should you annotate? • The problem: ‘named entity classification’ • What is better? – 2000 words per category (each of Per, Org, Loc, Prod, Time) – 5000 words per category (each of Per, Org, Loc, Prod, Time)
  • 39. Small Corpus – 4 Fold Cross-Validation Split Train Folds Test Fold First Run • 1, 2, 3 • 4 Second Run • 2, 3, 4 • 1 Third Run • 3, 4, 1 • 2 Fourth Run • 4, 1, 2 • 3
  • 40. Statistical significance in a paper significance estimate variance Remember to take Inter-Annotator Agreement into account
  • 41. How much do you annotate? So you increase the corpus size till that the error margins drop to a value that the experimenter considers sufficient. The smaller the error margins, the finer the comparisons the experimenter can make between algorithms.
  • 42. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 43. Avoid Mistakes • The problem: ‘train a classifier’ • What is better? – Train with all the data that you have, and then test on all the data that you have? – Train on half and test on the other half?
  • 44. Avoid Mistakes • Training a corpus on a full corpus and then running tests using the same corpus is a bad idea because it is a bit like revealing the questions in the exam before the exam. • A simple algorithm that can game such a test is a plain memorization algorithm that memorizes all the possible inputs and the corresponding outputs.
  • 45. Corpus Splits Split Percentage Training • 60% Validation • 20% Testing • 20% Total • 100%
  • 46. How do you avoid mistakes? Do not train a machine learning algorithm on the ‘testing’ section of the corpus. During the development/tuning of the algorithm, do not make any measurements using the ‘testing’ section, or you’re likely to ‘cheat’ on the feature set, and settings. Use the ‘validation’ section for that. I have seen researchers claim 99.7% accuracy on Indian language POS tagging because they failed to keep the different sections of their corpus sufficiently well separated.