SlideShare a Scribd company logo
Collocations
Reading: Chap 5, Manning & Schutze
(note: this chapter is available online from the book’s page
http://nlp.stanford.edu/fsnlp/promo)
Instructor: Rada Mihalcea
Slide 2
What is a Collocation?
• A COLLOCATION is an expression consisting of two or more words
that correspond to some conventional way of saying things.
• The words together can mean more than their sum of parts (The
Times of India, disk drive)
– Previous examples: hot dog, mother in law
• Examples of collocations
– noun phrases like strong tea and weapons of mass destruction
– phrasal verbs like to make up, and other phrases like the rich and
powerful.
• Valid or invalid?
– a stiff breeze but not a stiff wind (while either a strong breeze or a strong
wind is okay).
– broad daylight (but not bright daylight or narrow darkness).
Slide 3
Criteria for Collocations
• Typical criteria for collocations:
– non-compositionality
– non-substitutability
– non-modifiability.
• Collocations usually cannot be translated into other languages word
by word.
• A phrase can be a collocation even if it is not consecutive (as in the
example knock . . . door).
Slide 4
Non-Compositionality
• A phrase is compositional if the meaning can predicted from the
meaning of the parts.
– E.g. new companies
• A phrase is non-compositional if the meaning cannot be predicted
from the meaning of the parts
– E.g. hot dog
• Collocations are not necessarily fully compositional in that there is
usually an element of meaning added to the combination. Eg. strong
tea.
• Idioms are the most extreme examples of non-compositionality. Eg.
to hear it through the grapevine.
Slide 5
Non-Substitutability
• We cannot substitute near-synonyms for the components of a
collocation.
• For example
– We can’t say yellow wine instead of white wine even though yellow is as
good a description of the color of white wine as white is (it is kind of a
yellowish white).
• Many collocations cannot be freely modified with additional lexical
material or through grammatical transformations (Non-
modifiability).
– E.g. white wine, but not whiter wine
– mother in law, but not mother in laws
Slide 6
Linguistic Subclasses of Collocations
• Light verbs:
– Verbs with little semantic content like make, take and do.
– E.g. make lunch, take easy,
• Verb particle constructions
– E.g. to go down
• Proper nouns
– E.g. Bill Clinton
• Terminological expressions refer to concepts and objects in technical
domains.
– E.g. Hydraulic oil filter
Slide 7
Principal Approaches to Finding
Collocations
How to automatically identify collocations in text?
• Simplest method: Selection of collocations by frequency
• Selection based on mean and variance of the distance
between focal word and collocating word
• Hypothesis testing
• Mutual information
Slide 8
Frequency
• Find collocations by counting the number of
occurrences.
• Need also to define a maximum size window
• Usually results in a lot of function word pairs that need
to be filtered out.
• Fix: pass the candidate phrases through a part of-speech
filter which only lets through those patterns that are
likely to be “phrases”. (Justesen and Katz, 1995)
Slide 9
Most frequent bigrams in an
Example Corpus
Except for New York, all the
bigrams are pairs of
function words.
Slide 10
Part of speech tag patterns for collocation filtering
(Justesen and Katz).
Slide 11
The most highly ranked
phrases after applying
the filter on the same
corpus as before.
Slide 12
Collocational Window
Many collocations occur at variable distances. A
collocational window needs to be defined to locate these.
Frequency based approach can’t be used.
she knocked on his door
they knocked at the door
100 women knocked on Donaldson’s door
a man knocked on the metal front door
Slide 13
Mean and Variance
• The mean µ is the average offset between two words in the corpus.
• The variance s
– where n is the number of times the two words co-occur, di is the offset
for co-occurrence i, and  is the mean.
• Mean and variance characterize the distribution of distances
between two words in a corpus.
– High variance means that co-occurrence is mostly by chance
– Low variance means that the two words usually occur at about the same
distance.
s
Slide 14
Mean and Variance: An Example
For the knock, door example sentences the mean is:
And the sample deviation:
s
Slide 15
Finding collocations based on mean
and variance
Slide 16
Ruling out Chance
• Two words can co-occur by chance.
– High frequency and low variance can be accidental
• Hypothesis Testing measures the confidence that this co-
occurrence was really due to association, and not just due to chance.
• Formulate a null hypothesis H0 that there is no association between
the words beyond chance occurrences.
• The null hypothesis states what should be true if two words do not
form a collocation.
• If the null hypothesis can be rejected, then the two words do not co-
occur by chance, and they form a collocation
• Compute the probability p that the event would occur if H0 were true,
and then reject H0 if p is too low (typically if beneath a significance
level of p < 0.05, 0.01, 0.005, or 0.001) and retain H0 as possible
otherwise.
Slide 17
The t-Test
• t-test looks at the mean and variance of a sample of measurements,
where the null hypothesis is that the sample is drawn from a
distribution with mean µ.
• The test looks at the difference between the observed and expected
means, scaled by the variance of the data, and tells us how likely one
is to get a sample of that mean and variance, assuming that the
sample is drawn from a normal distribution with mean µ.
Where x is the sample mean, s2
is the sample
variance, N is the sample size, and µ is the mean of
the distribution.
Slide 18
t-Test for finding collocations
• Think of the text corpus as a long sequence of N bigrams, and the
samples are then indicator random variables with:
– value 1 when the bigram of interest occurs,
– 0 otherwise.
• The t-test and other statistical tests are useful as methods for
ranking collocations.
• Step 1: Determine the expected mean
• Step 2: Measure the observed mean
• Step 3: Run the t-test
Slide 19
t-Test: Example
• In our corpus, new occurs 15,828 times, companies 4,675
times, and there are 14,307,668 tokens overall.
• new companies occurs 8 times among the 14,307,668
bigrams
H0 : P(new companies) =P(new)P(companies)
Slide 20
t-Test example
• For this distribution µ= 3.615 x 10-7
and s2
=p(1-p) =~ p2
• t value of 0.999932 is not larger than 2.576, the critical
value for =0.005. So we cannot reject the null hypothesis
that new and companies occur independently and do not
form a collocation.
Slide 21
Hypothesis testing of differences (Church
and Hanks, 1989)
• To find words whose co-occurrence patterns best distinguish
between two words.
• For example, in computational lexicography we may want to find the
words that best differentiate the meanings of strong and powerful.
• The t-test is extended to the comparison of the means of two normal
populations.
• Here the null hypothesis is that the average difference is 0 (l=0).
• In the denominator we add the variances of the two populations
since the variance of the difference of two random variables is the
sum of their individual variances.
Slide 22
Hypothesis testing of differences
Words that co-occur significantly more frequently with powerful, and
with strong
t C(w) C(strong w) C(powerful w) Word
3.16 933 0 10 computers
2.82 2337 0 8 computer
2.44 289 0 6 symbol
2.44 588 0 5 Germany
2.23 3745 0 5 nation
7.07 3685 50 0 support
6.32 3616 58 7 enough
4.69 986 22 0 safety
4.58 3741 21 0 sales
4.02 1093 19 1 opposition
Slide 23
Pearson’s chi-square test
• t-test assumes that probabilities are approximately normally
distributed, which is not true in general. The χ2
test doesn’t make
this assumption.
• the essence of the χ2
test is to compare the observed frequencies
with the frequencies expected for independence
– if the difference between observed and expected frequencies is large,
then we can reject the null hypothesis of independence.
• Relies on co-occurrence table, and computes
Slide 24
χ2
Test: Example
The χ2
statistic sums the differences between observed and expected
values in all squares of the table, scaled by the magnitude of the
expected values, as follows:
where i ranges over rows of the table, j ranges over columns, Oij is
the
Slide 25
χ2
Test: Example
• Observed values O are given in the table
– E.g. O(1,1) = 8
• Expected values E are determined from marginal probabilities:
– E.g. E value for cell (1,1) = new companies is expected frequency for this
bigram, determined by multiplying:
• probability of new on first position of a bigram
• probability of companies on second position of a bigram
• total number of bigrams
– E(1,1) = (8+15820)/N * (8+4667)/N * N =~ 5.2
χ2
is then determined as 1.55
• Look up significance table:
χ2
= 3.8 for probability level of α = 0.05
– 1.55 < 3.8
– we cannot reject null hypothesis  new companies is not a collocation
Slide 26
Pointwise Mutual Information
• An information-theoretically motivated measure for discovering
interesting collocations is pointwise mutual information (Church et
al. 1989, 1991; Hindle 1990).
• It is roughly a measure of how much one word tells us about the
other.
Slide 27
Problems with using Mutual
Information
• Decrease in uncertainty is not always a good measure of an
interesting correspondence between two events.
• It is a bad measure of dependence.
• Particularly bad with sparse data.

More Related Content

What's hot

Semantic and Pragmatic
Semantic and PragmaticSemantic and Pragmatic
Semantic and Pragmatic
almirasumangca
 
Meaning
MeaningMeaning
Meaning
Ika Hentihu
 
Optimality theory.pptx
Optimality theory.pptxOptimality theory.pptx
Optimality theory.pptx
amjadnaasir
 
Lexical semantics
Lexical semanticsLexical semantics
Lexical semantics
Nila27
 
Sentence meaning is different from speaker’s meaning
Sentence meaning is different from speaker’s meaningSentence meaning is different from speaker’s meaning
Sentence meaning is different from speaker’s meaning
Hifza Kiyani
 
Syntax feature theory.. good
Syntax feature theory.. goodSyntax feature theory.. good
Syntax feature theory.. goodHina Honey
 
Systemic functional linguistics and metafunctions of language
Systemic functional linguistics and metafunctions of languageSystemic functional linguistics and metafunctions of language
Systemic functional linguistics and metafunctions of language
LearningandTeaching
 
Deep structure and surface structure
Deep structure and surface structureDeep structure and surface structure
Deep structure and surface structure
Asif Ali Raza
 
Phrase Structure Grammar
Phrase Structure GrammarPhrase Structure Grammar
Phrase Structure Grammar
Anusha Das
 
Pragmatics
Pragmatics Pragmatics
Pragmatics detailed lecture
Pragmatics detailed lecturePragmatics detailed lecture
Pragmatics detailed lecture
Arooba Dev
 
Heads in Syntax
Heads in SyntaxHeads in Syntax
Heads in Syntax
Komal644
 
Discourse analysis (Schmitt's book chapter 4)
Discourse analysis (Schmitt's book chapter 4)Discourse analysis (Schmitt's book chapter 4)
Discourse analysis (Schmitt's book chapter 4)Samira Rahmdel
 
Coherence, cohesion and deixis
Coherence, cohesion and deixisCoherence, cohesion and deixis
Coherence, cohesion and deixis
Marizka Parapat
 
Sociolinguistics.pptx
Sociolinguistics.pptxSociolinguistics.pptx
Sociolinguistics.pptx
YolandaFellecia
 
Pragmatics implicature
Pragmatics implicaturePragmatics implicature
Pragmatics implicaturephannguyen161
 
Accommodation theory
Accommodation theoryAccommodation theory
Accommodation theoryjkmurton
 
Two Views of Discourse Structure: As a Product and As a Process
Two Views of Discourse Structure: As a Product and As a ProcessTwo Views of Discourse Structure: As a Product and As a Process
Two Views of Discourse Structure: As a Product and As a Process
CRISALDO CORDURA
 
2. Halliday's+Theory.ppt
2. Halliday's+Theory.ppt2. Halliday's+Theory.ppt
2. Halliday's+Theory.ppt
AmaliaRahmaFirdaus
 

What's hot (20)

Semantic and Pragmatic
Semantic and PragmaticSemantic and Pragmatic
Semantic and Pragmatic
 
Meaning
MeaningMeaning
Meaning
 
Optimality theory.pptx
Optimality theory.pptxOptimality theory.pptx
Optimality theory.pptx
 
Lexical semantics
Lexical semanticsLexical semantics
Lexical semantics
 
Sentence meaning is different from speaker’s meaning
Sentence meaning is different from speaker’s meaningSentence meaning is different from speaker’s meaning
Sentence meaning is different from speaker’s meaning
 
Syntax feature theory.. good
Syntax feature theory.. goodSyntax feature theory.. good
Syntax feature theory.. good
 
Systemic functional linguistics and metafunctions of language
Systemic functional linguistics and metafunctions of languageSystemic functional linguistics and metafunctions of language
Systemic functional linguistics and metafunctions of language
 
Deep structure and surface structure
Deep structure and surface structureDeep structure and surface structure
Deep structure and surface structure
 
Phrase Structure Grammar
Phrase Structure GrammarPhrase Structure Grammar
Phrase Structure Grammar
 
Pragmatics
Pragmatics Pragmatics
Pragmatics
 
Reference and sense
Reference and senseReference and sense
Reference and sense
 
Pragmatics detailed lecture
Pragmatics detailed lecturePragmatics detailed lecture
Pragmatics detailed lecture
 
Heads in Syntax
Heads in SyntaxHeads in Syntax
Heads in Syntax
 
Discourse analysis (Schmitt's book chapter 4)
Discourse analysis (Schmitt's book chapter 4)Discourse analysis (Schmitt's book chapter 4)
Discourse analysis (Schmitt's book chapter 4)
 
Coherence, cohesion and deixis
Coherence, cohesion and deixisCoherence, cohesion and deixis
Coherence, cohesion and deixis
 
Sociolinguistics.pptx
Sociolinguistics.pptxSociolinguistics.pptx
Sociolinguistics.pptx
 
Pragmatics implicature
Pragmatics implicaturePragmatics implicature
Pragmatics implicature
 
Accommodation theory
Accommodation theoryAccommodation theory
Accommodation theory
 
Two Views of Discourse Structure: As a Product and As a Process
Two Views of Discourse Structure: As a Product and As a ProcessTwo Views of Discourse Structure: As a Product and As a Process
Two Views of Discourse Structure: As a Product and As a Process
 
2. Halliday's+Theory.ppt
2. Halliday's+Theory.ppt2. Halliday's+Theory.ppt
2. Halliday's+Theory.ppt
 

Viewers also liked

Why and how to teach collocations
Why and how to teach collocationsWhy and how to teach collocations
Why and how to teach collocationsYicel Cermeño
 
Semantics, ku
Semantics, kuSemantics, ku
Semantics, ku
violita99
 
U7 collocations related to computers - activity-4 to-1d
U7  collocations related to computers - activity-4 to-1dU7  collocations related to computers - activity-4 to-1d
U7 collocations related to computers - activity-4 to-1d
Anabel Milagros Montes Miranda
 
Weather Collocations
Weather CollocationsWeather Collocations
Weather CollocationsMatt Drew
 
Counts, comparisons, collocations, contestations: Towards a dictionary of the...
Counts, comparisons, collocations, contestations: Towards a dictionary of the...Counts, comparisons, collocations, contestations: Towards a dictionary of the...
Counts, comparisons, collocations, contestations: Towards a dictionary of the...
Idibon1
 
Collocations
CollocationsCollocations
Collocations
Mercè Ballabriga
 
Verb collocations
Verb collocationsVerb collocations
Verb collocations
AshleyBest
 
Verb noun collocations related
Verb noun collocations relatedVerb noun collocations related
Verb noun collocations related
Anabel Milagros Montes Miranda
 
Collocation by mahmoud abu qarmoul
Collocation by mahmoud abu qarmoulCollocation by mahmoud abu qarmoul
Collocation by mahmoud abu qarmoul
Mahmoud Qarmoul
 
Collocations 01
Collocations 01Collocations 01
Collocations 01dafarnum
 
Collocation
CollocationCollocation
Collocation
Buhsra
 
verb noun collocations
verb noun collocationsverb noun collocations
verb noun collocations
Tara Lockhart
 
Make / Do Collocations
Make / Do CollocationsMake / Do Collocations
Make / Do Collocations
antdela
 
Collocations
CollocationsCollocations
Collocations
Leticia Portugal
 
Collocation in use
Collocation in useCollocation in use
Collocation in use
Ahmad Zakki Mualana
 
Verb noun collocations
Verb noun collocationsVerb noun collocations
Verb noun collocations
Sandy Millin
 

Viewers also liked (20)

Why and how to teach collocations
Why and how to teach collocationsWhy and how to teach collocations
Why and how to teach collocations
 
Semantics, ku
Semantics, kuSemantics, ku
Semantics, ku
 
U7 collocations related to computers - activity-4 to-1d
U7  collocations related to computers - activity-4 to-1dU7  collocations related to computers - activity-4 to-1d
U7 collocations related to computers - activity-4 to-1d
 
Weather Collocations
Weather CollocationsWeather Collocations
Weather Collocations
 
Conversational collocations 5
Conversational collocations 5Conversational collocations 5
Conversational collocations 5
 
Counts, comparisons, collocations, contestations: Towards a dictionary of the...
Counts, comparisons, collocations, contestations: Towards a dictionary of the...Counts, comparisons, collocations, contestations: Towards a dictionary of the...
Counts, comparisons, collocations, contestations: Towards a dictionary of the...
 
Collocations
CollocationsCollocations
Collocations
 
Basic collocations in English.
Basic collocations in English.Basic collocations in English.
Basic collocations in English.
 
Collocations and idioms
Collocations and  idiomsCollocations and  idioms
Collocations and idioms
 
Verb collocations
Verb collocationsVerb collocations
Verb collocations
 
Verb noun collocations related
Verb noun collocations relatedVerb noun collocations related
Verb noun collocations related
 
Collocation by mahmoud abu qarmoul
Collocation by mahmoud abu qarmoulCollocation by mahmoud abu qarmoul
Collocation by mahmoud abu qarmoul
 
Collocations life events
Collocations life eventsCollocations life events
Collocations life events
 
Collocations 01
Collocations 01Collocations 01
Collocations 01
 
Collocation
CollocationCollocation
Collocation
 
verb noun collocations
verb noun collocationsverb noun collocations
verb noun collocations
 
Make / Do Collocations
Make / Do CollocationsMake / Do Collocations
Make / Do Collocations
 
Collocations
CollocationsCollocations
Collocations
 
Collocation in use
Collocation in useCollocation in use
Collocation in use
 
Verb noun collocations
Verb noun collocationsVerb noun collocations
Verb noun collocations
 

Similar to Collocations 5

35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
rhetttrevannion
 
Ilja state2014expressivity
Ilja state2014expressivityIlja state2014expressivity
Ilja state2014expressivitymaartenmarx
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
pujashri1975
 
Random Variable & Probability Distribution 1.pptx
Random Variable & Probability Distribution 1.pptxRandom Variable & Probability Distribution 1.pptx
Random Variable & Probability Distribution 1.pptx
JAYARSOCIAS3
 
Fuzzy mathematics:An application oriented introduction
Fuzzy mathematics:An application oriented introductionFuzzy mathematics:An application oriented introduction
Fuzzy mathematics:An application oriented introduction
Nagasuri Bala Venkateswarlu
 
Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learning
Sadia Zafar
 
Hypothesis
HypothesisHypothesis
Hypothesis
LIVING KIMARIO
 
2 UNIT-DSP.pptx
2 UNIT-DSP.pptx2 UNIT-DSP.pptx
2 UNIT-DSP.pptx
PothyeswariPothyes
 
Logic
LogicLogic
Day 3 SPSS
Day 3 SPSSDay 3 SPSS
Day 3 SPSS
abir hossain
 
35880 Topic Discussion7Number of Pages 1 (Double Spaced).docx
35880 Topic Discussion7Number of Pages 1 (Double Spaced).docx35880 Topic Discussion7Number of Pages 1 (Double Spaced).docx
35880 Topic Discussion7Number of Pages 1 (Double Spaced).docx
domenicacullison
 
AI NOTES ppt 4.pdf
AI NOTES ppt 4.pdfAI NOTES ppt 4.pdf
AI NOTES ppt 4.pdf
ARMANVERMA7
 
35881 DiscussionNumber of Pages 1 (Double Spaced)Number o.docx
35881 DiscussionNumber of Pages 1 (Double Spaced)Number o.docx35881 DiscussionNumber of Pages 1 (Double Spaced)Number o.docx
35881 DiscussionNumber of Pages 1 (Double Spaced)Number o.docx
rhetttrevannion
 
Introduction to probabilities and radom variables
Introduction to probabilities and radom variablesIntroduction to probabilities and radom variables
Introduction to probabilities and radom variables
mohammedderriche2
 
Probability concepts and the normal distribution
Probability concepts and the normal distributionProbability concepts and the normal distribution
Probability concepts and the normal distributionHarve Abella
 
AI_Probability.pptx
AI_Probability.pptxAI_Probability.pptx
AI_Probability.pptx
ssuserc8e745
 
template.pptx
template.pptxtemplate.pptx
template.pptx
uzmasulthana3
 
Chapter 2 Simple Linear Regression Model.pptx
Chapter 2 Simple Linear Regression Model.pptxChapter 2 Simple Linear Regression Model.pptx
Chapter 2 Simple Linear Regression Model.pptx
aschalew shiferaw
 

Similar to Collocations 5 (20)

35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
 
Ilja state2014expressivity
Ilja state2014expressivityIlja state2014expressivity
Ilja state2014expressivity
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
 
Random Variable & Probability Distribution 1.pptx
Random Variable & Probability Distribution 1.pptxRandom Variable & Probability Distribution 1.pptx
Random Variable & Probability Distribution 1.pptx
 
Fuzzy mathematics:An application oriented introduction
Fuzzy mathematics:An application oriented introductionFuzzy mathematics:An application oriented introduction
Fuzzy mathematics:An application oriented introduction
 
Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learning
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
2 UNIT-DSP.pptx
2 UNIT-DSP.pptx2 UNIT-DSP.pptx
2 UNIT-DSP.pptx
 
Logic
LogicLogic
Logic
 
Day 3 SPSS
Day 3 SPSSDay 3 SPSS
Day 3 SPSS
 
35880 Topic Discussion7Number of Pages 1 (Double Spaced).docx
35880 Topic Discussion7Number of Pages 1 (Double Spaced).docx35880 Topic Discussion7Number of Pages 1 (Double Spaced).docx
35880 Topic Discussion7Number of Pages 1 (Double Spaced).docx
 
AI NOTES ppt 4.pdf
AI NOTES ppt 4.pdfAI NOTES ppt 4.pdf
AI NOTES ppt 4.pdf
 
35881 DiscussionNumber of Pages 1 (Double Spaced)Number o.docx
35881 DiscussionNumber of Pages 1 (Double Spaced)Number o.docx35881 DiscussionNumber of Pages 1 (Double Spaced)Number o.docx
35881 DiscussionNumber of Pages 1 (Double Spaced)Number o.docx
 
Statistics
StatisticsStatistics
Statistics
 
Introduction to probabilities and radom variables
Introduction to probabilities and radom variablesIntroduction to probabilities and radom variables
Introduction to probabilities and radom variables
 
Probability concepts and the normal distribution
Probability concepts and the normal distributionProbability concepts and the normal distribution
Probability concepts and the normal distribution
 
AI_Probability.pptx
AI_Probability.pptxAI_Probability.pptx
AI_Probability.pptx
 
template.pptx
template.pptxtemplate.pptx
template.pptx
 
Statistics
StatisticsStatistics
Statistics
 
Chapter 2 Simple Linear Regression Model.pptx
Chapter 2 Simple Linear Regression Model.pptxChapter 2 Simple Linear Regression Model.pptx
Chapter 2 Simple Linear Regression Model.pptx
 

Recently uploaded

GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Nguyen Thanh Tu Collection
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
bennyroshan06
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
PedroFerreira53928
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
Excellence Foundation for South Sudan
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 

Recently uploaded (20)

GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 

Collocations 5

  • 1. Collocations Reading: Chap 5, Manning & Schutze (note: this chapter is available online from the book’s page http://nlp.stanford.edu/fsnlp/promo) Instructor: Rada Mihalcea
  • 2. Slide 2 What is a Collocation? • A COLLOCATION is an expression consisting of two or more words that correspond to some conventional way of saying things. • The words together can mean more than their sum of parts (The Times of India, disk drive) – Previous examples: hot dog, mother in law • Examples of collocations – noun phrases like strong tea and weapons of mass destruction – phrasal verbs like to make up, and other phrases like the rich and powerful. • Valid or invalid? – a stiff breeze but not a stiff wind (while either a strong breeze or a strong wind is okay). – broad daylight (but not bright daylight or narrow darkness).
  • 3. Slide 3 Criteria for Collocations • Typical criteria for collocations: – non-compositionality – non-substitutability – non-modifiability. • Collocations usually cannot be translated into other languages word by word. • A phrase can be a collocation even if it is not consecutive (as in the example knock . . . door).
  • 4. Slide 4 Non-Compositionality • A phrase is compositional if the meaning can predicted from the meaning of the parts. – E.g. new companies • A phrase is non-compositional if the meaning cannot be predicted from the meaning of the parts – E.g. hot dog • Collocations are not necessarily fully compositional in that there is usually an element of meaning added to the combination. Eg. strong tea. • Idioms are the most extreme examples of non-compositionality. Eg. to hear it through the grapevine.
  • 5. Slide 5 Non-Substitutability • We cannot substitute near-synonyms for the components of a collocation. • For example – We can’t say yellow wine instead of white wine even though yellow is as good a description of the color of white wine as white is (it is kind of a yellowish white). • Many collocations cannot be freely modified with additional lexical material or through grammatical transformations (Non- modifiability). – E.g. white wine, but not whiter wine – mother in law, but not mother in laws
  • 6. Slide 6 Linguistic Subclasses of Collocations • Light verbs: – Verbs with little semantic content like make, take and do. – E.g. make lunch, take easy, • Verb particle constructions – E.g. to go down • Proper nouns – E.g. Bill Clinton • Terminological expressions refer to concepts and objects in technical domains. – E.g. Hydraulic oil filter
  • 7. Slide 7 Principal Approaches to Finding Collocations How to automatically identify collocations in text? • Simplest method: Selection of collocations by frequency • Selection based on mean and variance of the distance between focal word and collocating word • Hypothesis testing • Mutual information
  • 8. Slide 8 Frequency • Find collocations by counting the number of occurrences. • Need also to define a maximum size window • Usually results in a lot of function word pairs that need to be filtered out. • Fix: pass the candidate phrases through a part of-speech filter which only lets through those patterns that are likely to be “phrases”. (Justesen and Katz, 1995)
  • 9. Slide 9 Most frequent bigrams in an Example Corpus Except for New York, all the bigrams are pairs of function words.
  • 10. Slide 10 Part of speech tag patterns for collocation filtering (Justesen and Katz).
  • 11. Slide 11 The most highly ranked phrases after applying the filter on the same corpus as before.
  • 12. Slide 12 Collocational Window Many collocations occur at variable distances. A collocational window needs to be defined to locate these. Frequency based approach can’t be used. she knocked on his door they knocked at the door 100 women knocked on Donaldson’s door a man knocked on the metal front door
  • 13. Slide 13 Mean and Variance • The mean µ is the average offset between two words in the corpus. • The variance s – where n is the number of times the two words co-occur, di is the offset for co-occurrence i, and  is the mean. • Mean and variance characterize the distribution of distances between two words in a corpus. – High variance means that co-occurrence is mostly by chance – Low variance means that the two words usually occur at about the same distance. s
  • 14. Slide 14 Mean and Variance: An Example For the knock, door example sentences the mean is: And the sample deviation: s
  • 15. Slide 15 Finding collocations based on mean and variance
  • 16. Slide 16 Ruling out Chance • Two words can co-occur by chance. – High frequency and low variance can be accidental • Hypothesis Testing measures the confidence that this co- occurrence was really due to association, and not just due to chance. • Formulate a null hypothesis H0 that there is no association between the words beyond chance occurrences. • The null hypothesis states what should be true if two words do not form a collocation. • If the null hypothesis can be rejected, then the two words do not co- occur by chance, and they form a collocation • Compute the probability p that the event would occur if H0 were true, and then reject H0 if p is too low (typically if beneath a significance level of p < 0.05, 0.01, 0.005, or 0.001) and retain H0 as possible otherwise.
  • 17. Slide 17 The t-Test • t-test looks at the mean and variance of a sample of measurements, where the null hypothesis is that the sample is drawn from a distribution with mean µ. • The test looks at the difference between the observed and expected means, scaled by the variance of the data, and tells us how likely one is to get a sample of that mean and variance, assuming that the sample is drawn from a normal distribution with mean µ. Where x is the sample mean, s2 is the sample variance, N is the sample size, and µ is the mean of the distribution.
  • 18. Slide 18 t-Test for finding collocations • Think of the text corpus as a long sequence of N bigrams, and the samples are then indicator random variables with: – value 1 when the bigram of interest occurs, – 0 otherwise. • The t-test and other statistical tests are useful as methods for ranking collocations. • Step 1: Determine the expected mean • Step 2: Measure the observed mean • Step 3: Run the t-test
  • 19. Slide 19 t-Test: Example • In our corpus, new occurs 15,828 times, companies 4,675 times, and there are 14,307,668 tokens overall. • new companies occurs 8 times among the 14,307,668 bigrams H0 : P(new companies) =P(new)P(companies)
  • 20. Slide 20 t-Test example • For this distribution µ= 3.615 x 10-7 and s2 =p(1-p) =~ p2 • t value of 0.999932 is not larger than 2.576, the critical value for =0.005. So we cannot reject the null hypothesis that new and companies occur independently and do not form a collocation.
  • 21. Slide 21 Hypothesis testing of differences (Church and Hanks, 1989) • To find words whose co-occurrence patterns best distinguish between two words. • For example, in computational lexicography we may want to find the words that best differentiate the meanings of strong and powerful. • The t-test is extended to the comparison of the means of two normal populations. • Here the null hypothesis is that the average difference is 0 (l=0). • In the denominator we add the variances of the two populations since the variance of the difference of two random variables is the sum of their individual variances.
  • 22. Slide 22 Hypothesis testing of differences Words that co-occur significantly more frequently with powerful, and with strong t C(w) C(strong w) C(powerful w) Word 3.16 933 0 10 computers 2.82 2337 0 8 computer 2.44 289 0 6 symbol 2.44 588 0 5 Germany 2.23 3745 0 5 nation 7.07 3685 50 0 support 6.32 3616 58 7 enough 4.69 986 22 0 safety 4.58 3741 21 0 sales 4.02 1093 19 1 opposition
  • 23. Slide 23 Pearson’s chi-square test • t-test assumes that probabilities are approximately normally distributed, which is not true in general. The χ2 test doesn’t make this assumption. • the essence of the χ2 test is to compare the observed frequencies with the frequencies expected for independence – if the difference between observed and expected frequencies is large, then we can reject the null hypothesis of independence. • Relies on co-occurrence table, and computes
  • 24. Slide 24 χ2 Test: Example The χ2 statistic sums the differences between observed and expected values in all squares of the table, scaled by the magnitude of the expected values, as follows: where i ranges over rows of the table, j ranges over columns, Oij is the
  • 25. Slide 25 χ2 Test: Example • Observed values O are given in the table – E.g. O(1,1) = 8 • Expected values E are determined from marginal probabilities: – E.g. E value for cell (1,1) = new companies is expected frequency for this bigram, determined by multiplying: • probability of new on first position of a bigram • probability of companies on second position of a bigram • total number of bigrams – E(1,1) = (8+15820)/N * (8+4667)/N * N =~ 5.2 χ2 is then determined as 1.55 • Look up significance table: χ2 = 3.8 for probability level of α = 0.05 – 1.55 < 3.8 – we cannot reject null hypothesis  new companies is not a collocation
  • 26. Slide 26 Pointwise Mutual Information • An information-theoretically motivated measure for discovering interesting collocations is pointwise mutual information (Church et al. 1989, 1991; Hindle 1990). • It is roughly a measure of how much one word tells us about the other.
  • 27. Slide 27 Problems with using Mutual Information • Decrease in uncertainty is not always a good measure of an interesting correspondence between two events. • It is a bad measure of dependence. • Particularly bad with sparse data.