SlideShare a Scribd company logo
Statistical Tools for Linguists

        Cohan Sujay Carlos
           Aiaioo Labs
           Bangalore
Text Analysis and Statistical Methods

 • Motivation
 • Statistics and Probabilities
 • Application to Corpus Linguistics
Motivation
• Human Development is all about Tools
  – Describe the world
  – Explain the world
  – Solve problems in the world
• Some of these tools
  – Language
  – Algorithms
  – Statistics and Probabilities
Motivation – Algorithms for Education Policy

 • 300 to 400 million people are illiterate
 • If we took 1000 teachers, 100 students per
   class, and 3 years of teaching per student

   –12000 years
 • If we had 100,000 teachers

   –120 years
Motivation – Algorithms for Education Policy

 • 300 to 400 million people are illiterate
 • If we took 1 teacher, 10 students per class,
   and 3 years of teaching per student.
 • Then each student teaches 10 more students.

   – about 30 years
 • We could turn the whole world literate in

   – about 34 years
Motivation – Algorithms for Education Policy


 Difference:

 Policy 1 is O(n) time
 Policy 2 is O(log n) time
Motivation – Statistics for Linguists

 We have shown that:
 Using a tool from computer science, we can
 solve a problem in quite another area.

                   SIMILARLY

 Linguists will find statistics to be a handy tool
 to better understand languages.
Applications of Statistics to Linguistics


     • How can statistics be useful?
     • Can probabilities be useful?
Introduction to Aiaioo Labs
• Focus on Text Analysis, NLP, ML, AI
• Applications to business problems
• Team consists of
  – Researchers
     • Cohan
     • Madhulika
     • Sumukh
  – Linguists
  – Engineers
  – Marketing
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
  – Wordnet
  – Google terabyte corpus (with annotations?)
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
  – Wordnet (set of rules about the real world)
  – Google terabyte corpus (real world)
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
    – Wordnet (not countable)
    – Google terabyte corpus (countable)



For training machine learning algorithms, the latter might be more valuable,
just because it is possible to tally up evidence on the latter corpus.

Of course I am simplifying things a lot and I don’t mean that the former is not
valuable at all.
Approach to corpus construction
 So if you are constructing a corpus on
 which machine learning methods might
 be applied, construct your corpus so that
 you retain as many examples of surface
 forms as possible.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
Problem : Spelling

1.   Field
2.   Wield
3.   Shield
4.   Deceive
5.   Receive
6.   Ceiling

                       Courtesy of http://norvig.com/chomsky.html
Rule-based Approach


    “I before E except after C”

-- an example of a linguistic insight




                 Courtesy of http://norvig.com/chomsky.html
Probabilistic Statistical Model:
• Count the occurrences of ‘ie’ and ‘ei’ and ‘cie’
  and ‘cei’ in a large corpus

P(IE) = 0.0177
P(EI) = 0.0046
P(CIE) = 0.0014
P(CEI) = 0.0005

                         Courtesy of http://norvig.com/chomsky.html
Words where ie occur after c
•   science
•   society
•   ancient
•   species




                   Courtesy of http://norvig.com/chomsky.html
But you can go back to a Rule-based
             Approach


  “I before E except after C only if C is not
               preceded by an S”

    -- an example of a linguistic insight


                      Courtesy of http://norvig.com/chomsky.html
What is a probability?

• A number between 0 and 1
• The sum of the probabilities on all outcomes is 1

Heads                  Tails




• P(heads) = 0.5
• P(tails) = 0.5
Estimation of P(IE)



P(“IE”) = C(“IE”) / C(all two letter sequences in my corpus)
What is Estimation?



P(“UN”) = C(“UN”) / C(all words in my corpus)
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
How do you annotate?
• The problem: ‘named entity classification’
• What is better?
  – Per, Org, Loc, Prod, Time
  – Right, Wrong
How do you annotate?
• The problem: ‘named entity classification’
• What is better?
  – Per, Org, Loc, Prod, Time
  – Right, Wrong



      It depends on whether you care about
      precision or recall or both.
What are Precision and Recall



 Classification metrics used to compare ML
 algorithms.
Classification Metrics

     Politics                   Sports

The UN Security            Warwickshire's Clarke
Council adopts its first   equalled the first-class
clear condemnation of      record of seven

    How do you compare two ML algorithms?
Classification Quality Metrics
               Point of view = Politics

                            Gold - Politics       Gold - Sports

Observed - Politics   TP (True Positive)      FP (False Positive)


Observed - Sports     FN (False Negative)     TN (True Negative)
Classification Quality Metrics
               Point of view = Sports

                           Gold - Politics       Gold - Sports

Observed - Politics   TN (True Negative)     FN (False Positive)


Observed - Sports     FP (False Negative)    TP (True Positive)
Classification Quality Metric - Accuracy
                   Point of view = Sports

                               Gold - Politics      Gold – Sports

    Observed - Politics   TN (True Negative)     FN (False Positive)


    Observed - Sports     FP (False Negative)    TP (True Positive)
Metrics for Measuring Classification Quality
                      Point of View – Class 1

                              Gold Class 1               Gold Class 2

   Observed Class 1     TP                          FP


   Observed Class 2     FN                          TN




                Great metrics for highly unbalanced corpora!
Metrics for Measuring Classification Quality




  F-Score = the harmonic mean of Precision and Recall
F-Score Generalized


            1
 F
       1           1
        (1   )
       P           R
Precision, Recall, Average, F-Score

                 Precision          Recall         Average           F-Score

 Classifier 1   50%              50%             50%                50%


 Classifier 2   30%              70%             50%                42%


 Classifier 3   10%              90%             50%                18%




                 What is the sort of classifier that fares worst?
How do you annotate?
So if you are constructing a corpus for a
machine learning tool where only
precision matters, all you need is a corpus
of presumed positives that you mark as
right or wrong (or the label and other).

If you need to get good recall as well, you
will need a corpus annotated with all the
relevant labels.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
How much data should you annotate?
  • The problem: ‘named entity classification’
  • What is better?
    – 2000 words per category (each of Per, Org,
      Loc, Prod, Time)
    – 5000 words per category (each of Per, Org,
      Loc, Prod, Time)
Small Corpus – 4 Fold Cross-Validation

          Split          Train Folds         Test Fold

       First Run    • 1, 2, 3          • 4


       Second Run   • 2, 3, 4          • 1


       Third Run    • 3, 4, 1          • 2


       Fourth Run   • 4, 1, 2          • 3
Statistical significance in a paper


                              significance         estimate
                                                         variance




    Remember to take Inter-Annotator Agreement into account
How much do you annotate?
So you increase the corpus size till that
the error margins drop to a value that the
experimenter considers sufficient.

The smaller the error margins, the finer
the comparisons the experimenter can
make between algorithms.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the
    corpus
Avoid Mistakes
• The problem: ‘train a classifier’
• What is better?
  – Train with all the data that you have, and
    then test on all the data that you have?
  – Train on half and test on the other half?
Avoid Mistakes
• Training a corpus on a full corpus and
  then running tests using the same corpus
  is a bad idea because it is a bit like
  revealing the questions in the exam
  before the exam.
• A simple algorithm that can game such a
  test is a plain memorization algorithm
  that memorizes all the possible inputs
  and the corresponding outputs.
Corpus Splits

        Split            Percentage

Training        • 60%


Validation      • 20%


Testing         • 20%


Total           • 100%
How do you avoid mistakes?
Do not train a machine learning algorithm on the
‘testing’ section of the corpus.

During the development/tuning of the algorithm,
do not make any measurements using the
‘testing’ section, or you’re likely to ‘cheat’ on the
feature set, and settings. Use the ‘validation’
section for that.

I have seen researchers claim 99.7% accuracy on
Indian language POS tagging because they failed
to keep the different sections of their corpus
sufficiently well separated.

More Related Content

What's hot

Interview Skills
Interview SkillsInterview Skills
Interview Skills
gugankarthik
 
Business writing-clear-and-simple
Business writing-clear-and-simpleBusiness writing-clear-and-simple
Business writing-clear-and-simplengocjos
 
Testing grammar and vocabulary
Testing grammar and vocabularyTesting grammar and vocabulary
Testing grammar and vocabulary
marinasr_
 
Esp
EspEsp
Hassan presentation of corpus
Hassan presentation of corpusHassan presentation of corpus
Hassan presentation of corpus
Hassan Ammar
 
Audience analysis
Audience analysisAudience analysis
Audience analysisAjayavg165
 
Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...
Moses Altovar
 
Audience analysis 1
Audience analysis 1Audience analysis 1
Audience analysis 1
Maria E. Cortez
 
Introduction to Translation (Part I)
Introduction to Translation (Part I)Introduction to Translation (Part I)
Introduction to Translation (Part I)Erna Mariana
 
Statement of purpose
Statement of purposeStatement of purpose
Statement of purpose
Narvik University College
 
SDL Trados training course
SDL Trados training courseSDL Trados training course
SDL Trados training course
Qabiria
 
Breast cancer genetic testing: Is it right for you?
Breast cancer genetic testing: Is it right for you?Breast cancer genetic testing: Is it right for you?
Breast cancer genetic testing: Is it right for you?
Via Christi Health
 
What's New in Metastatic Research and Clinical Trials: ER Positive and Triple...
What's New in Metastatic Research and Clinical Trials: ER Positive and Triple...What's New in Metastatic Research and Clinical Trials: ER Positive and Triple...
What's New in Metastatic Research and Clinical Trials: ER Positive and Triple...
Dana-Farber Cancer Institute
 
Paralinguistic (Communication Skills)
Paralinguistic (Communication Skills)Paralinguistic (Communication Skills)
Paralinguistic (Communication Skills)
Digvijaysinh Gohil
 
Picture Word Inductive Model
Picture Word Inductive ModelPicture Word Inductive Model
Picture Word Inductive Model
irmarisrn
 
Interview skills by Helen Hendrickson
Interview  skills by Helen HendricksonInterview  skills by Helen Hendrickson
Interview skills by Helen Hendrickson
Helen Hendrickson
 
Testing Grammar
Testing GrammarTesting Grammar
Testing GrammarSamcruz5
 
PHONEME DISCRIMINATION
PHONEME DISCRIMINATIONPHONEME DISCRIMINATION
PHONEME DISCRIMINATION
AlexisJohn5
 
Interviewing+skills+for+interviewees
Interviewing+skills+for+intervieweesInterviewing+skills+for+interviewees
Interviewing+skills+for+intervieweesnasef Sayed
 

What's hot (20)

Interview Skills
Interview SkillsInterview Skills
Interview Skills
 
Business writing-clear-and-simple
Business writing-clear-and-simpleBusiness writing-clear-and-simple
Business writing-clear-and-simple
 
Testing grammar and vocabulary
Testing grammar and vocabularyTesting grammar and vocabulary
Testing grammar and vocabulary
 
Esp
EspEsp
Esp
 
Hassan presentation of corpus
Hassan presentation of corpusHassan presentation of corpus
Hassan presentation of corpus
 
Audience analysis
Audience analysisAudience analysis
Audience analysis
 
Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...
 
Machine Tanslation
Machine TanslationMachine Tanslation
Machine Tanslation
 
Audience analysis 1
Audience analysis 1Audience analysis 1
Audience analysis 1
 
Introduction to Translation (Part I)
Introduction to Translation (Part I)Introduction to Translation (Part I)
Introduction to Translation (Part I)
 
Statement of purpose
Statement of purposeStatement of purpose
Statement of purpose
 
SDL Trados training course
SDL Trados training courseSDL Trados training course
SDL Trados training course
 
Breast cancer genetic testing: Is it right for you?
Breast cancer genetic testing: Is it right for you?Breast cancer genetic testing: Is it right for you?
Breast cancer genetic testing: Is it right for you?
 
What's New in Metastatic Research and Clinical Trials: ER Positive and Triple...
What's New in Metastatic Research and Clinical Trials: ER Positive and Triple...What's New in Metastatic Research and Clinical Trials: ER Positive and Triple...
What's New in Metastatic Research and Clinical Trials: ER Positive and Triple...
 
Paralinguistic (Communication Skills)
Paralinguistic (Communication Skills)Paralinguistic (Communication Skills)
Paralinguistic (Communication Skills)
 
Picture Word Inductive Model
Picture Word Inductive ModelPicture Word Inductive Model
Picture Word Inductive Model
 
Interview skills by Helen Hendrickson
Interview  skills by Helen HendricksonInterview  skills by Helen Hendrickson
Interview skills by Helen Hendrickson
 
Testing Grammar
Testing GrammarTesting Grammar
Testing Grammar
 
PHONEME DISCRIMINATION
PHONEME DISCRIMINATIONPHONEME DISCRIMINATION
PHONEME DISCRIMINATION
 
Interviewing+skills+for+interviewees
Interviewing+skills+for+intervieweesInterviewing+skills+for+interviewees
Interviewing+skills+for+interviewees
 

Similar to Statistics for linguistics

NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
Hemantha Kulathilake
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
Andrea Arcuri
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...
SQALab
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
StampedeCon
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
Alexander Sibiryakov
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...
SQALab
 
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Les Perelman
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
manaswidebbarma1
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
Ivan Berlocher
 
R Data Structures (Part 1)
R Data Structures (Part 1)R Data Structures (Part 1)
R Data Structures (Part 1)
Victor Ordu
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lifeng (Aaron) Han
 
Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.
Sunil Kumar Kopparapu
 
Survey Research in Software Engineering
Survey Research in Software EngineeringSurvey Research in Software Engineering
Survey Research in Software Engineering
Daniel Mendez
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
Xavier Ochoa
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
Anuj Gupta
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016
George Roth
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
Polytechnic University of Bari
 

Similar to Statistics for linguistics (20)

NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...
 
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
R Data Structures (Part 1)
R Data Structures (Part 1)R Data Structures (Part 1)
R Data Structures (Part 1)
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.
 
Survey Research in Software Engineering
Survey Research in Software EngineeringSurvey Research in Software Engineering
Survey Research in Software Engineering
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 

More from aiaioo

Document Analysis with Deep Learning
Document Analysis with Deep LearningDocument Analysis with Deep Learning
Document Analysis with Deep Learning
aiaioo
 
Deep Learning through Pytorch Exercises
Deep Learning through Pytorch ExercisesDeep Learning through Pytorch Exercises
Deep Learning through Pytorch Exercises
aiaioo
 
Learning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text ClassificationLearning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text Classification
aiaioo
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Tools
aiaioo
 
Fun with Text - Managing Text Analytics
Fun with Text - Managing Text AnalyticsFun with Text - Managing Text Analytics
Fun with Text - Managing Text Analytics
aiaioo
 
Arduino for Indian Languages
Arduino for Indian LanguagesArduino for Indian Languages
Arduino for Indian Languages
aiaioo
 
Fun with Text - Hacking Text Analytics
Fun with Text - Hacking Text AnalyticsFun with Text - Hacking Text Analytics
Fun with Text - Hacking Text Analytics
aiaioo
 
Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)
aiaioo
 
Rules engines to machine learning
Rules engines to machine learningRules engines to machine learning
Rules engines to machine learning
aiaioo
 
Aiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly FuturisticAiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly Futuristic
aiaioo
 

More from aiaioo (10)

Document Analysis with Deep Learning
Document Analysis with Deep LearningDocument Analysis with Deep Learning
Document Analysis with Deep Learning
 
Deep Learning through Pytorch Exercises
Deep Learning through Pytorch ExercisesDeep Learning through Pytorch Exercises
Deep Learning through Pytorch Exercises
 
Learning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text ClassificationLearning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text Classification
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Tools
 
Fun with Text - Managing Text Analytics
Fun with Text - Managing Text AnalyticsFun with Text - Managing Text Analytics
Fun with Text - Managing Text Analytics
 
Arduino for Indian Languages
Arduino for Indian LanguagesArduino for Indian Languages
Arduino for Indian Languages
 
Fun with Text - Hacking Text Analytics
Fun with Text - Hacking Text AnalyticsFun with Text - Hacking Text Analytics
Fun with Text - Hacking Text Analytics
 
Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)
 
Rules engines to machine learning
Rules engines to machine learningRules engines to machine learning
Rules engines to machine learning
 
Aiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly FuturisticAiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly Futuristic
 

Recently uploaded

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 

Recently uploaded (20)

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 

Statistics for linguistics

  • 1. Statistical Tools for Linguists Cohan Sujay Carlos Aiaioo Labs Bangalore
  • 2. Text Analysis and Statistical Methods • Motivation • Statistics and Probabilities • Application to Corpus Linguistics
  • 3. Motivation • Human Development is all about Tools – Describe the world – Explain the world – Solve problems in the world • Some of these tools – Language – Algorithms – Statistics and Probabilities
  • 4. Motivation – Algorithms for Education Policy • 300 to 400 million people are illiterate • If we took 1000 teachers, 100 students per class, and 3 years of teaching per student –12000 years • If we had 100,000 teachers –120 years
  • 5. Motivation – Algorithms for Education Policy • 300 to 400 million people are illiterate • If we took 1 teacher, 10 students per class, and 3 years of teaching per student. • Then each student teaches 10 more students. – about 30 years • We could turn the whole world literate in – about 34 years
  • 6. Motivation – Algorithms for Education Policy Difference: Policy 1 is O(n) time Policy 2 is O(log n) time
  • 7. Motivation – Statistics for Linguists We have shown that: Using a tool from computer science, we can solve a problem in quite another area. SIMILARLY Linguists will find statistics to be a handy tool to better understand languages.
  • 8. Applications of Statistics to Linguistics • How can statistics be useful? • Can probabilities be useful?
  • 9. Introduction to Aiaioo Labs • Focus on Text Analysis, NLP, ML, AI • Applications to business problems • Team consists of – Researchers • Cohan • Madhulika • Sumukh – Linguists – Engineers – Marketing
  • 10. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 11. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet – Google terabyte corpus (with annotations?)
  • 12. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet (set of rules about the real world) – Google terabyte corpus (real world)
  • 13. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet (not countable) – Google terabyte corpus (countable) For training machine learning algorithms, the latter might be more valuable, just because it is possible to tally up evidence on the latter corpus. Of course I am simplifying things a lot and I don’t mean that the former is not valuable at all.
  • 14. Approach to corpus construction So if you are constructing a corpus on which machine learning methods might be applied, construct your corpus so that you retain as many examples of surface forms as possible.
  • 15. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 16. Problem : Spelling 1. Field 2. Wield 3. Shield 4. Deceive 5. Receive 6. Ceiling Courtesy of http://norvig.com/chomsky.html
  • 17. Rule-based Approach “I before E except after C” -- an example of a linguistic insight Courtesy of http://norvig.com/chomsky.html
  • 18. Probabilistic Statistical Model: • Count the occurrences of ‘ie’ and ‘ei’ and ‘cie’ and ‘cei’ in a large corpus P(IE) = 0.0177 P(EI) = 0.0046 P(CIE) = 0.0014 P(CEI) = 0.0005 Courtesy of http://norvig.com/chomsky.html
  • 19. Words where ie occur after c • science • society • ancient • species Courtesy of http://norvig.com/chomsky.html
  • 20. But you can go back to a Rule-based Approach “I before E except after C only if C is not preceded by an S” -- an example of a linguistic insight Courtesy of http://norvig.com/chomsky.html
  • 21. What is a probability? • A number between 0 and 1 • The sum of the probabilities on all outcomes is 1 Heads Tails • P(heads) = 0.5 • P(tails) = 0.5
  • 22. Estimation of P(IE) P(“IE”) = C(“IE”) / C(all two letter sequences in my corpus)
  • 23. What is Estimation? P(“UN”) = C(“UN”) / C(all words in my corpus)
  • 24. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 25. How do you annotate? • The problem: ‘named entity classification’ • What is better? – Per, Org, Loc, Prod, Time – Right, Wrong
  • 26. How do you annotate? • The problem: ‘named entity classification’ • What is better? – Per, Org, Loc, Prod, Time – Right, Wrong It depends on whether you care about precision or recall or both.
  • 27. What are Precision and Recall Classification metrics used to compare ML algorithms.
  • 28. Classification Metrics Politics Sports The UN Security Warwickshire's Clarke Council adopts its first equalled the first-class clear condemnation of record of seven How do you compare two ML algorithms?
  • 29. Classification Quality Metrics Point of view = Politics Gold - Politics Gold - Sports Observed - Politics TP (True Positive) FP (False Positive) Observed - Sports FN (False Negative) TN (True Negative)
  • 30. Classification Quality Metrics Point of view = Sports Gold - Politics Gold - Sports Observed - Politics TN (True Negative) FN (False Positive) Observed - Sports FP (False Negative) TP (True Positive)
  • 31. Classification Quality Metric - Accuracy Point of view = Sports Gold - Politics Gold – Sports Observed - Politics TN (True Negative) FN (False Positive) Observed - Sports FP (False Negative) TP (True Positive)
  • 32. Metrics for Measuring Classification Quality Point of View – Class 1 Gold Class 1 Gold Class 2 Observed Class 1 TP FP Observed Class 2 FN TN Great metrics for highly unbalanced corpora!
  • 33. Metrics for Measuring Classification Quality F-Score = the harmonic mean of Precision and Recall
  • 34. F-Score Generalized 1 F 1 1   (1   ) P R
  • 35. Precision, Recall, Average, F-Score Precision Recall Average F-Score Classifier 1 50% 50% 50% 50% Classifier 2 30% 70% 50% 42% Classifier 3 10% 90% 50% 18% What is the sort of classifier that fares worst?
  • 36. How do you annotate? So if you are constructing a corpus for a machine learning tool where only precision matters, all you need is a corpus of presumed positives that you mark as right or wrong (or the label and other). If you need to get good recall as well, you will need a corpus annotated with all the relevant labels.
  • 37. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 38. How much data should you annotate? • The problem: ‘named entity classification’ • What is better? – 2000 words per category (each of Per, Org, Loc, Prod, Time) – 5000 words per category (each of Per, Org, Loc, Prod, Time)
  • 39. Small Corpus – 4 Fold Cross-Validation Split Train Folds Test Fold First Run • 1, 2, 3 • 4 Second Run • 2, 3, 4 • 1 Third Run • 3, 4, 1 • 2 Fourth Run • 4, 1, 2 • 3
  • 40. Statistical significance in a paper significance estimate variance Remember to take Inter-Annotator Agreement into account
  • 41. How much do you annotate? So you increase the corpus size till that the error margins drop to a value that the experimenter considers sufficient. The smaller the error margins, the finer the comparisons the experimenter can make between algorithms.
  • 42. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 43. Avoid Mistakes • The problem: ‘train a classifier’ • What is better? – Train with all the data that you have, and then test on all the data that you have? – Train on half and test on the other half?
  • 44. Avoid Mistakes • Training a corpus on a full corpus and then running tests using the same corpus is a bad idea because it is a bit like revealing the questions in the exam before the exam. • A simple algorithm that can game such a test is a plain memorization algorithm that memorizes all the possible inputs and the corresponding outputs.
  • 45. Corpus Splits Split Percentage Training • 60% Validation • 20% Testing • 20% Total • 100%
  • 46. How do you avoid mistakes? Do not train a machine learning algorithm on the ‘testing’ section of the corpus. During the development/tuning of the algorithm, do not make any measurements using the ‘testing’ section, or you’re likely to ‘cheat’ on the feature set, and settings. Use the ‘validation’ section for that. I have seen researchers claim 99.7% accuracy on Indian language POS tagging because they failed to keep the different sections of their corpus sufficiently well separated.