SlideShare a Scribd company logo
1 of 46
Download to read offline
Statistical Tools for Linguists

        Cohan Sujay Carlos
           Aiaioo Labs
           Bangalore
Text Analysis and Statistical Methods

 • Motivation
 • Statistics and Probabilities
 • Application to Corpus Linguistics
Motivation
• Human Development is all about Tools
  – Describe the world
  – Explain the world
  – Solve problems in the world
• Some of these tools
  – Language
  – Algorithms
  – Statistics and Probabilities
Motivation – Algorithms for Education Policy

 • 300 to 400 million people are illiterate
 • If we took 1000 teachers, 100 students per
   class, and 3 years of teaching per student

   –12000 years
 • If we had 100,000 teachers

   –120 years
Motivation – Algorithms for Education Policy

 • 300 to 400 million people are illiterate
 • If we took 1 teacher, 10 students per class,
   and 3 years of teaching per student.
 • Then each student teaches 10 more students.

   – about 30 years
 • We could turn the whole world literate in

   – about 34 years
Motivation – Algorithms for Education Policy


 Difference:

 Policy 1 is O(n) time
 Policy 2 is O(log n) time
Motivation – Statistics for Linguists

 We have shown that:
 Using a tool from computer science, we can
 solve a problem in quite another area.

                   SIMILARLY

 Linguists will find statistics to be a handy tool
 to better understand languages.
Applications of Statistics to Linguistics


     • How can statistics be useful?
     • Can probabilities be useful?
Introduction to Aiaioo Labs
• Focus on Text Analysis, NLP, ML, AI
• Applications to business problems
• Team consists of
  – Researchers
     • Cohan
     • Madhulika
     • Sumukh
  – Linguists
  – Engineers
  – Marketing
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
  – Wordnet
  – Google terabyte corpus (with annotations?)
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
  – Wordnet (set of rules about the real world)
  – Google terabyte corpus (real world)
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
    – Wordnet (not countable)
    – Google terabyte corpus (countable)



For training machine learning algorithms, the latter might be more valuable,
just because it is possible to tally up evidence on the latter corpus.

Of course I am simplifying things a lot and I don’t mean that the former is not
valuable at all.
Approach to corpus construction
 So if you are constructing a corpus on
 which machine learning methods might
 be applied, construct your corpus so that
 you retain as many examples of surface
 forms as possible.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
Problem : Spelling

1.   Field
2.   Wield
3.   Shield
4.   Deceive
5.   Receive
6.   Ceiling

                       Courtesy of http://norvig.com/chomsky.html
Rule-based Approach


    “I before E except after C”

-- an example of a linguistic insight




                 Courtesy of http://norvig.com/chomsky.html
Probabilistic Statistical Model:
• Count the occurrences of ‘ie’ and ‘ei’ and ‘cie’
  and ‘cei’ in a large corpus

P(IE) = 0.0177
P(EI) = 0.0046
P(CIE) = 0.0014
P(CEI) = 0.0005

                         Courtesy of http://norvig.com/chomsky.html
Words where ie occur after c
•   science
•   society
•   ancient
•   species




                   Courtesy of http://norvig.com/chomsky.html
But you can go back to a Rule-based
             Approach


  “I before E except after C only if C is not
               preceded by an S”

    -- an example of a linguistic insight


                      Courtesy of http://norvig.com/chomsky.html
What is a probability?

• A number between 0 and 1
• The sum of the probabilities on all outcomes is 1

Heads                  Tails




• P(heads) = 0.5
• P(tails) = 0.5
Estimation of P(IE)



P(“IE”) = C(“IE”) / C(all two letter sequences in my corpus)
What is Estimation?



P(“UN”) = C(“UN”) / C(all words in my corpus)
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
How do you annotate?
• The problem: ‘named entity classification’
• What is better?
  – Per, Org, Loc, Prod, Time
  – Right, Wrong
How do you annotate?
• The problem: ‘named entity classification’
• What is better?
  – Per, Org, Loc, Prod, Time
  – Right, Wrong



      It depends on whether you care about
      precision or recall or both.
What are Precision and Recall



 Classification metrics used to compare ML
 algorithms.
Classification Metrics

     Politics                   Sports

The UN Security            Warwickshire's Clarke
Council adopts its first   equalled the first-class
clear condemnation of      record of seven

    How do you compare two ML algorithms?
Classification Quality Metrics
               Point of view = Politics

                            Gold - Politics       Gold - Sports

Observed - Politics   TP (True Positive)      FP (False Positive)


Observed - Sports     FN (False Negative)     TN (True Negative)
Classification Quality Metrics
               Point of view = Sports

                           Gold - Politics       Gold - Sports

Observed - Politics   TN (True Negative)     FN (False Positive)


Observed - Sports     FP (False Negative)    TP (True Positive)
Classification Quality Metric - Accuracy
                   Point of view = Sports

                               Gold - Politics      Gold – Sports

    Observed - Politics   TN (True Negative)     FN (False Positive)


    Observed - Sports     FP (False Negative)    TP (True Positive)
Metrics for Measuring Classification Quality
                      Point of View – Class 1

                              Gold Class 1               Gold Class 2

   Observed Class 1     TP                          FP


   Observed Class 2     FN                          TN




                Great metrics for highly unbalanced corpora!
Metrics for Measuring Classification Quality




  F-Score = the harmonic mean of Precision and Recall
F-Score Generalized


            1
 F
       1           1
        (1   )
       P           R
Precision, Recall, Average, F-Score

                 Precision          Recall         Average           F-Score

 Classifier 1   50%              50%             50%                50%


 Classifier 2   30%              70%             50%                42%


 Classifier 3   10%              90%             50%                18%




                 What is the sort of classifier that fares worst?
How do you annotate?
So if you are constructing a corpus for a
machine learning tool where only
precision matters, all you need is a corpus
of presumed positives that you mark as
right or wrong (or the label and other).

If you need to get good recall as well, you
will need a corpus annotated with all the
relevant labels.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
How much data should you annotate?
  • The problem: ‘named entity classification’
  • What is better?
    – 2000 words per category (each of Per, Org,
      Loc, Prod, Time)
    – 5000 words per category (each of Per, Org,
      Loc, Prod, Time)
Small Corpus – 4 Fold Cross-Validation

          Split          Train Folds         Test Fold

       First Run    • 1, 2, 3          • 4


       Second Run   • 2, 3, 4          • 1


       Third Run    • 3, 4, 1          • 2


       Fourth Run   • 4, 1, 2          • 3
Statistical significance in a paper


                              significance         estimate
                                                         variance




    Remember to take Inter-Annotator Agreement into account
How much do you annotate?
So you increase the corpus size till that
the error margins drop to a value that the
experimenter considers sufficient.

The smaller the error margins, the finer
the comparisons the experimenter can
make between algorithms.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the
    corpus
Avoid Mistakes
• The problem: ‘train a classifier’
• What is better?
  – Train with all the data that you have, and
    then test on all the data that you have?
  – Train on half and test on the other half?
Avoid Mistakes
• Training a corpus on a full corpus and
  then running tests using the same corpus
  is a bad idea because it is a bit like
  revealing the questions in the exam
  before the exam.
• A simple algorithm that can game such a
  test is a plain memorization algorithm
  that memorizes all the possible inputs
  and the corresponding outputs.
Corpus Splits

        Split            Percentage

Training        • 60%


Validation      • 20%


Testing         • 20%


Total           • 100%
How do you avoid mistakes?
Do not train a machine learning algorithm on the
‘testing’ section of the corpus.

During the development/tuning of the algorithm,
do not make any measurements using the
‘testing’ section, or you’re likely to ‘cheat’ on the
feature set, and settings. Use the ‘validation’
section for that.

I have seen researchers claim 99.7% accuracy on
Indian language POS tagging because they failed
to keep the different sections of their corpus
sufficiently well separated.

More Related Content

What's hot

Societal multilingualism
Societal multilingualismSocietal multilingualism
Societal multilingualismWinda Widia
 
Linguistic Fundamentals in Translation and Translation Studies
Linguistic Fundamentals in Translation and Translation StudiesLinguistic Fundamentals in Translation and Translation Studies
Linguistic Fundamentals in Translation and Translation StudiesSugey7
 
Fairclough et al, critical discourse analysis
Fairclough et al, critical discourse analysisFairclough et al, critical discourse analysis
Fairclough et al, critical discourse analysisSamira Rahmdel
 
The history of writing
The history of writingThe history of writing
The history of writingEsme McAvoy
 
The origins of language
The origins of languageThe origins of language
The origins of language07437666
 
Lexical Relations in Semantic
Lexical Relations in SemanticLexical Relations in Semantic
Lexical Relations in SemanticAyu Monita
 
Model question paper for sociolinguistics
Model question paper for sociolinguisticsModel question paper for sociolinguistics
Model question paper for sociolinguisticsAnil Pudota
 
Language & Communication
Language & CommunicationLanguage & Communication
Language & CommunicationXianah Montales
 
Systemic functional linguistics and metafunctions of language
Systemic functional linguistics and metafunctions of languageSystemic functional linguistics and metafunctions of language
Systemic functional linguistics and metafunctions of languageLearningandTeaching
 
Morphological Analysis
Morphological AnalysisMorphological Analysis
Morphological AnalysisAkshat Pandey
 
Language and gender
Language and gender  Language and gender
Language and gender emanomari
 
Technology and language learning
Technology and language learningTechnology and language learning
Technology and language learningKhadija Hamidani
 
Pragmatics georgeyule-
Pragmatics georgeyule-Pragmatics georgeyule-
Pragmatics georgeyule-Hifza Kiyani
 
Ayesha prrsntaton on folk linguistic beliefs
Ayesha prrsntaton on folk linguistic beliefsAyesha prrsntaton on folk linguistic beliefs
Ayesha prrsntaton on folk linguistic beliefsG.P.G.C Mardan
 
Theories in Language Description
Theories in Language DescriptionTheories in Language Description
Theories in Language DescriptionMohsin Anayat Ch
 
Structures in government binding Model
Structures in government binding ModelStructures in government binding Model
Structures in government binding ModelHajar Moghaddasi
 
Language Choice & Language Learning
Language Choice & Language LearningLanguage Choice & Language Learning
Language Choice & Language Learning Bishara Adam
 

What's hot (20)

Societal multilingualism
Societal multilingualismSocietal multilingualism
Societal multilingualism
 
Linguistic Fundamentals in Translation and Translation Studies
Linguistic Fundamentals in Translation and Translation StudiesLinguistic Fundamentals in Translation and Translation Studies
Linguistic Fundamentals in Translation and Translation Studies
 
Fairclough et al, critical discourse analysis
Fairclough et al, critical discourse analysisFairclough et al, critical discourse analysis
Fairclough et al, critical discourse analysis
 
The history of writing
The history of writingThe history of writing
The history of writing
 
The origins of language
The origins of languageThe origins of language
The origins of language
 
Wh Movement
Wh MovementWh Movement
Wh Movement
 
Lexical Relations in Semantic
Lexical Relations in SemanticLexical Relations in Semantic
Lexical Relations in Semantic
 
Model question paper for sociolinguistics
Model question paper for sociolinguisticsModel question paper for sociolinguistics
Model question paper for sociolinguistics
 
Language & Communication
Language & CommunicationLanguage & Communication
Language & Communication
 
Systemic functional linguistics and metafunctions of language
Systemic functional linguistics and metafunctions of languageSystemic functional linguistics and metafunctions of language
Systemic functional linguistics and metafunctions of language
 
Morphological Analysis
Morphological AnalysisMorphological Analysis
Morphological Analysis
 
Language and gender
Language and gender  Language and gender
Language and gender
 
Technology and language learning
Technology and language learningTechnology and language learning
Technology and language learning
 
Register
RegisterRegister
Register
 
Pragmatics georgeyule-
Pragmatics georgeyule-Pragmatics georgeyule-
Pragmatics georgeyule-
 
Language and culture
Language and cultureLanguage and culture
Language and culture
 
Ayesha prrsntaton on folk linguistic beliefs
Ayesha prrsntaton on folk linguistic beliefsAyesha prrsntaton on folk linguistic beliefs
Ayesha prrsntaton on folk linguistic beliefs
 
Theories in Language Description
Theories in Language DescriptionTheories in Language Description
Theories in Language Description
 
Structures in government binding Model
Structures in government binding ModelStructures in government binding Model
Structures in government binding Model
 
Language Choice & Language Learning
Language Choice & Language LearningLanguage Choice & Language Learning
Language Choice & Language Learning
 

Similar to Statistics for linguistics

NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsAndrea Arcuri
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...SQALab
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014StampedeCon
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...SQALab
 
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Les Perelman
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.pptmanaswidebbarma1
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needsIvan Berlocher
 
R Data Structures (Part 1)
R Data Structures (Part 1)R Data Structures (Part 1)
R Data Structures (Part 1)Victor Ordu
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.Sunil Kumar Kopparapu
 
Survey Research in Software Engineering
Survey Research in Software EngineeringSurvey Research in Software Engineering
Survey Research in Software EngineeringDaniel Mendez
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016George Roth
 

Similar to Statistics for linguistics (20)

NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...
 
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
R Data Structures (Part 1)
R Data Structures (Part 1)R Data Structures (Part 1)
R Data Structures (Part 1)
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.
 
Survey Research in Software Engineering
Survey Research in Software EngineeringSurvey Research in Software Engineering
Survey Research in Software Engineering
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 

More from aiaioo

Document Analysis with Deep Learning
Document Analysis with Deep LearningDocument Analysis with Deep Learning
Document Analysis with Deep Learningaiaioo
 
Deep Learning through Pytorch Exercises
Deep Learning through Pytorch ExercisesDeep Learning through Pytorch Exercises
Deep Learning through Pytorch Exercisesaiaioo
 
Learning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text ClassificationLearning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text Classificationaiaioo
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Toolsaiaioo
 
Fun with Text - Managing Text Analytics
Fun with Text - Managing Text AnalyticsFun with Text - Managing Text Analytics
Fun with Text - Managing Text Analyticsaiaioo
 
Arduino for Indian Languages
Arduino for Indian LanguagesArduino for Indian Languages
Arduino for Indian Languagesaiaioo
 
Fun with Text - Hacking Text Analytics
Fun with Text - Hacking Text AnalyticsFun with Text - Hacking Text Analytics
Fun with Text - Hacking Text Analyticsaiaioo
 
Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)aiaioo
 
Rules engines to machine learning
Rules engines to machine learningRules engines to machine learning
Rules engines to machine learningaiaioo
 
Aiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly FuturisticAiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly Futuristicaiaioo
 

More from aiaioo (10)

Document Analysis with Deep Learning
Document Analysis with Deep LearningDocument Analysis with Deep Learning
Document Analysis with Deep Learning
 
Deep Learning through Pytorch Exercises
Deep Learning through Pytorch ExercisesDeep Learning through Pytorch Exercises
Deep Learning through Pytorch Exercises
 
Learning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text ClassificationLearning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text Classification
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Tools
 
Fun with Text - Managing Text Analytics
Fun with Text - Managing Text AnalyticsFun with Text - Managing Text Analytics
Fun with Text - Managing Text Analytics
 
Arduino for Indian Languages
Arduino for Indian LanguagesArduino for Indian Languages
Arduino for Indian Languages
 
Fun with Text - Hacking Text Analytics
Fun with Text - Hacking Text AnalyticsFun with Text - Hacking Text Analytics
Fun with Text - Hacking Text Analytics
 
Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)
 
Rules engines to machine learning
Rules engines to machine learningRules engines to machine learning
Rules engines to machine learning
 
Aiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly FuturisticAiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly Futuristic
 

Recently uploaded

Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideStefan Dietze
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfalexjohnson7307
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 

Recently uploaded (20)

Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 

Statistics for linguistics

  • 1. Statistical Tools for Linguists Cohan Sujay Carlos Aiaioo Labs Bangalore
  • 2. Text Analysis and Statistical Methods • Motivation • Statistics and Probabilities • Application to Corpus Linguistics
  • 3. Motivation • Human Development is all about Tools – Describe the world – Explain the world – Solve problems in the world • Some of these tools – Language – Algorithms – Statistics and Probabilities
  • 4. Motivation – Algorithms for Education Policy • 300 to 400 million people are illiterate • If we took 1000 teachers, 100 students per class, and 3 years of teaching per student –12000 years • If we had 100,000 teachers –120 years
  • 5. Motivation – Algorithms for Education Policy • 300 to 400 million people are illiterate • If we took 1 teacher, 10 students per class, and 3 years of teaching per student. • Then each student teaches 10 more students. – about 30 years • We could turn the whole world literate in – about 34 years
  • 6. Motivation – Algorithms for Education Policy Difference: Policy 1 is O(n) time Policy 2 is O(log n) time
  • 7. Motivation – Statistics for Linguists We have shown that: Using a tool from computer science, we can solve a problem in quite another area. SIMILARLY Linguists will find statistics to be a handy tool to better understand languages.
  • 8. Applications of Statistics to Linguistics • How can statistics be useful? • Can probabilities be useful?
  • 9. Introduction to Aiaioo Labs • Focus on Text Analysis, NLP, ML, AI • Applications to business problems • Team consists of – Researchers • Cohan • Madhulika • Sumukh – Linguists – Engineers – Marketing
  • 10. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 11. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet – Google terabyte corpus (with annotations?)
  • 12. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet (set of rules about the real world) – Google terabyte corpus (real world)
  • 13. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet (not countable) – Google terabyte corpus (countable) For training machine learning algorithms, the latter might be more valuable, just because it is possible to tally up evidence on the latter corpus. Of course I am simplifying things a lot and I don’t mean that the former is not valuable at all.
  • 14. Approach to corpus construction So if you are constructing a corpus on which machine learning methods might be applied, construct your corpus so that you retain as many examples of surface forms as possible.
  • 15. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 16. Problem : Spelling 1. Field 2. Wield 3. Shield 4. Deceive 5. Receive 6. Ceiling Courtesy of http://norvig.com/chomsky.html
  • 17. Rule-based Approach “I before E except after C” -- an example of a linguistic insight Courtesy of http://norvig.com/chomsky.html
  • 18. Probabilistic Statistical Model: • Count the occurrences of ‘ie’ and ‘ei’ and ‘cie’ and ‘cei’ in a large corpus P(IE) = 0.0177 P(EI) = 0.0046 P(CIE) = 0.0014 P(CEI) = 0.0005 Courtesy of http://norvig.com/chomsky.html
  • 19. Words where ie occur after c • science • society • ancient • species Courtesy of http://norvig.com/chomsky.html
  • 20. But you can go back to a Rule-based Approach “I before E except after C only if C is not preceded by an S” -- an example of a linguistic insight Courtesy of http://norvig.com/chomsky.html
  • 21. What is a probability? • A number between 0 and 1 • The sum of the probabilities on all outcomes is 1 Heads Tails • P(heads) = 0.5 • P(tails) = 0.5
  • 22. Estimation of P(IE) P(“IE”) = C(“IE”) / C(all two letter sequences in my corpus)
  • 23. What is Estimation? P(“UN”) = C(“UN”) / C(all words in my corpus)
  • 24. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 25. How do you annotate? • The problem: ‘named entity classification’ • What is better? – Per, Org, Loc, Prod, Time – Right, Wrong
  • 26. How do you annotate? • The problem: ‘named entity classification’ • What is better? – Per, Org, Loc, Prod, Time – Right, Wrong It depends on whether you care about precision or recall or both.
  • 27. What are Precision and Recall Classification metrics used to compare ML algorithms.
  • 28. Classification Metrics Politics Sports The UN Security Warwickshire's Clarke Council adopts its first equalled the first-class clear condemnation of record of seven How do you compare two ML algorithms?
  • 29. Classification Quality Metrics Point of view = Politics Gold - Politics Gold - Sports Observed - Politics TP (True Positive) FP (False Positive) Observed - Sports FN (False Negative) TN (True Negative)
  • 30. Classification Quality Metrics Point of view = Sports Gold - Politics Gold - Sports Observed - Politics TN (True Negative) FN (False Positive) Observed - Sports FP (False Negative) TP (True Positive)
  • 31. Classification Quality Metric - Accuracy Point of view = Sports Gold - Politics Gold – Sports Observed - Politics TN (True Negative) FN (False Positive) Observed - Sports FP (False Negative) TP (True Positive)
  • 32. Metrics for Measuring Classification Quality Point of View – Class 1 Gold Class 1 Gold Class 2 Observed Class 1 TP FP Observed Class 2 FN TN Great metrics for highly unbalanced corpora!
  • 33. Metrics for Measuring Classification Quality F-Score = the harmonic mean of Precision and Recall
  • 34. F-Score Generalized 1 F 1 1   (1   ) P R
  • 35. Precision, Recall, Average, F-Score Precision Recall Average F-Score Classifier 1 50% 50% 50% 50% Classifier 2 30% 70% 50% 42% Classifier 3 10% 90% 50% 18% What is the sort of classifier that fares worst?
  • 36. How do you annotate? So if you are constructing a corpus for a machine learning tool where only precision matters, all you need is a corpus of presumed positives that you mark as right or wrong (or the label and other). If you need to get good recall as well, you will need a corpus annotated with all the relevant labels.
  • 37. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 38. How much data should you annotate? • The problem: ‘named entity classification’ • What is better? – 2000 words per category (each of Per, Org, Loc, Prod, Time) – 5000 words per category (each of Per, Org, Loc, Prod, Time)
  • 39. Small Corpus – 4 Fold Cross-Validation Split Train Folds Test Fold First Run • 1, 2, 3 • 4 Second Run • 2, 3, 4 • 1 Third Run • 3, 4, 1 • 2 Fourth Run • 4, 1, 2 • 3
  • 40. Statistical significance in a paper significance estimate variance Remember to take Inter-Annotator Agreement into account
  • 41. How much do you annotate? So you increase the corpus size till that the error margins drop to a value that the experimenter considers sufficient. The smaller the error margins, the finer the comparisons the experimenter can make between algorithms.
  • 42. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 43. Avoid Mistakes • The problem: ‘train a classifier’ • What is better? – Train with all the data that you have, and then test on all the data that you have? – Train on half and test on the other half?
  • 44. Avoid Mistakes • Training a corpus on a full corpus and then running tests using the same corpus is a bad idea because it is a bit like revealing the questions in the exam before the exam. • A simple algorithm that can game such a test is a plain memorization algorithm that memorizes all the possible inputs and the corresponding outputs.
  • 45. Corpus Splits Split Percentage Training • 60% Validation • 20% Testing • 20% Total • 100%
  • 46. How do you avoid mistakes? Do not train a machine learning algorithm on the ‘testing’ section of the corpus. During the development/tuning of the algorithm, do not make any measurements using the ‘testing’ section, or you’re likely to ‘cheat’ on the feature set, and settings. Use the ‘validation’ section for that. I have seen researchers claim 99.7% accuracy on Indian language POS tagging because they failed to keep the different sections of their corpus sufficiently well separated.