SlideShare a Scribd company logo
1 of 37
Корпусная лингвистика
Введение в корпусную лингвистику
What is a corpus?
• a collection of words?
• Is it a theory or methodology of language?
Why use a corpus?
• Large amounts of data tell us about tendencies
and what’s normal or typical in real-life language
use
• Corpora also reveal instances of very rare or
exceptional cases, that we wouldn’t get from
looking at single texts or introspection.
• Human researchers make mistakes and are
slow. Computers are much quicker and more
accurate.
Criteria in building a corpus
1. It must be a large body of text.
2. It needs to be representative of language (or
a genre of language).
3. Must be in machine-readable form (e.g. txt
files on a computer).
4. Acts as a standard reference about what’s
typical in language.
5. Often annotated with additional linguistic
information – e.g. grammatical codes.
annotation and mark-up
corpus texts may be enriched with additional information to ease
analysis.
Note that this type of additional information may be called ‘mark up’,
‘annotation’, or ‘tagging’. All three terms are near synonyms.
Annotation usually refers to linguistic information encoded in a corpus
- however, the encoding is achieved using a mark-up language.
Similarly, the annotation itself is usually undertaken by putting so
called tags - short codes to indicate some linguistics feature - into a
text. Hence, while the terms can be separated, they can also be used
inter-changeably!
One final note - an xml tag finishes with a forward slash rather than a
back slash.
Some untagged text
“Arrest warrant out for Clowes’ partner years
before collapse.”
By Daniel John
A WARRANT for the arrest of the former partner
of Mr Peter Clowes was issued seven years
before his Barlow Clowes investment empire
collapsed, according to evidence submitted to
the Parliamentary Ombudsman.
Add tags for headers and paragraphs
<head type=MAIN>
“Arrest warrant out for Clowes’ partner years before collapse.”
</head>
<head type=BYLINE>
By Daniel John
</head>
<p>
A WARRANT for the arrest of the former partner of Mr Peter Clowes
was
issued seven years before his Barlow Clowes investment empire
collapsed, according to evidence submitted to the Parliamentary
Ombudsman.
</p>
• Add sentence tags
<head type=MAIN>
<s n=001>“Arrest warrant out for Clowes’ partner years before
collapse.”
</head>
<head type=BYLINE>
<s n=002>By Daniel John
</head>
<p>
<s n=003>A WARRANT for the arrest of the former partner of Mr Peter
Clowes was issued seven years before his Barlow Clowes investment
empire collapsed, according to evidence submitted to the
Parliamentary Ombudsman.
</p>
Change quotes to SGML
<head type=MAIN>
<s n=001>&bquo;Arrest warrant out for Clowes’ partner years before
collapse&equo;
</head>
<head type=BYLINE>
<s n=002>By Daniel John
</head>
<p>
<s n=003>A WARRANT for the arrest of the former partner of Mr Peter
Clowes was
issued seven years before his Barlow Clowes investment empire
collapsed,
according to evidence submitted to the Parliamentary Ombudsman.
</p>
Add tags for punctuation
<head type=MAIN>
<s n=001><c PUQ>&bquo;Arrest warrant out for Clowes<c PUN>’ partner
years
before collapse <c PUQ>&equo;
</head>
<head type=BYLINE>
<s n=002>By Daniel John
</head>
<p>
<s n=003>A WARRANT for the arrest of the former partner of Mr Peter
Clowes
was issued seven years before his Barlow Clowes investment empire
collapsed, according to evidence submitted to the Parliamentary Ombudsman
<c PUN>.
</p>
Add grammatical codes to word
units
<head type=MAIN>
<s n=001><c PUQ>&bquo<w NN1>Arrest <w NN1>warrant <w AVP>out <w PRP>for <w
NP0>Clowes<c PUN>’ <w NN1>partner <w NN2>years <w PRP>before <w NN1>collapse <c
PUQ>&equo’
<c PUN>.
</head>
<head type=BYLINE>
<s n=002><w PRP>By <w NP0>Daniel <w NP0>John
</head>
<p>
<s n=003><w AT1>A <w=NN1>WARRANT <w PRP>for <w AT0>the <w NN1>arrest <w PRF>of <w
AT0>the <w DT0>former <w NN1>partner <w PRF>of <w NP0>Mr <w NP0>Peter <w
NP0>Clowes <w VBD>was <w VVN>issued <w CRD>seven <w NN2>years <w CJS>before <w
DPS>his <w NN1-NP0>Barlow <w NP0>Clowes <w NN1>investment <w NN1>empire <w
VVD>collapsed<c PUN>, <w PRP>according to <w NN1>evidence <w VVN>submitted <w PRP>to <w
AT0>the <w AJ0>Parliamentary <w NN1>Ombudsman<c PUN>.
</p>
Types of Corpora
1 Specialised corpus – e.g.
• genre: the language of newspapers
• time: 2005 to the present day
• place: just texts published in China
2 General corpus – needs to be much larger. E.g.
The British
National Corpus (BNC) has about 100 million words
of
spoken and written British English:
The BNC
Types of Corpora
3. Multilingual corpus – e.g. English and Spanish. Or American
English and Indian English. http://ice-corpora.net/ICE/INDEX.HTM
4. Parallel corpus – e.g. English and Spanish – exactly the
same texts translated. E.g. the CRATER corpus
http://catalog.elra.info/product_info.php?products_id=84
5. Learner corpus – language use created by people learning a
particular language. E.g. the International Corpus of
Learner English.
6. Historical or Diachronic corpus – e.g. Helsinki corpus – 1.5 million
words of texts from 700AD to 1700AD.
7. Monitor corpus – continually being added to. e.g. the Bank
of English
http://www.collins.co.uk/page/Wordbanks+Online
frequency data, concordances and
collocation
• Frequencies
Your query "wash" returned 2415 matches in
952 different texts (in 97,626,093 words; freq:
24.74 instances per million words)
Concordances aka Key Word In Context
Concordance (sorted at 1L)
Collocations
Corpora and Language Teaching
Textbooks
• Dictionaries
• Classroom Exercises
• Tests
• Learner Corpora
Limitations of Corpus linguistics
It won’t tell us if something is possible in a language, or
well-formed. E.g. is “he expired of heart disease” acceptable
English?
• Any generalisations we make from corpus data can only be
deductions – not facts.
• Corpora give us evidence, but not information or
explanations. Why do women say “wash” more than men?
• Corpora give us language out of context – so no visual
information e.g. pictures, fonts etc. And with spoken data –
no information on what the speakers look like, behaviour or
body language.
Further Reading
• McEnery, Tony & Wilson, Andrew (2001)
Corpus Linguistics.
Edinburgh: Edinburgh University Press. Chapter
1.
• Hunston, S. (2002) Corpora in Applied
Linguistics.
Cambridge: Cambridge University Press. Chapter
1.
Question 1
What is a corpus?
• A theory of language.
• A collection of texts stored on a computer.
• An electronic database similar to a dictionary.
• Any large collection of words such as a
collection of books, newspapers or magazines.
Question 2
What is the main reason for using corpora?
• Other methods of language analysis are not reliable.
• Computers can confirm our intuitions about language.
• Computers can help us discover interesting patterns in
language which would be difficult to spot otherwise.
• With corpora we can answer all research questions
about language.
Question 3
What is corpus annotation?
• Adding an extra layer of information to the
text to allow for more sophisticated searches.
• Separating text into sentences.
• Manual coding of text for parts of speech.
• Adding critical comments to a text.
Question 4
What is a specialised corpus?
• A corpus that is used for historical language investigations.
• A corpus that is composed of a large variety of genres.
• A corpus that is used by language specialists.
•
• A corpus that focuses on e.g. one type of genre, one period,
one place etc.
Question 5
Which of these is NOT a type of corpus?
• Multilingual corpus
• Learner corpus
• Diachronic corpus
• Observer corpus
Question 6
What is the BNC?
• A large general corpus of British English.
• A corpus of different genres of English writing.
• A large spoken corpus of British English.
• A specialised corpus representing the language of
newspapers.
Question 7
Which of these statements is NOT true about a
monitor corpus?
• It is frequently updated.
• The Bank of English is an example of a monitor
corpus.
• The BNC is an example of a monitor corpus.
• It is used to monitor rapid change in language.
Question 8
What is a concordance?
• Information about word frequencies normalised
per million words.
• Listing of examples of a word searched in a
corpus with some context on the right and some
context on the left.
• An alphabetical list of words that appear in a text.
• A list of words and their frequencies that can be
used for identifying important words in a text.
Question 9
What is collocation?
• The tendency of speakers to talk over each other.
• The tendency of words to co-occur with one
another.
• The tendency of words to appear in unique,
different contexts each time.
• The tendency of sentences to create meaning.
Question 10
What is a frequency distribution in a corpus?
• Information about how frequent a word is in a corpus.
• Information about the frequency of use of a term
across a number of different texts, corpus sections,
speakers etc.
• Information about how frequent a word is per million
words.
• Sociolinguistic information about the gender of the
speakers that are represented in a corpus.
Brown and LOB View 80 comments
These corpora are sometimes referred to as ‘snapshot’ corpora - their design is such that they try to represent
a broad range of genres of published, professionally authored, English. Their goal is to capture the language at
one moment in time, hence the term ‘snapshot’.
Of course, as with any snapshot there are things you see and things you do not see. So, in this case, we are
looking at professionally authored written English - not speech and not writing of a more informal variety. We
are also only looking at certain genres. As with any snapshot, it was taken at a certain point of time in a certain
place - Brown is America in the early 1960s, LOB is the UK in the early 1960s. Such corpora are often used to
compare and contrast varieties of a language - in this case two varieties of English. They can also be looked at
on their own to explore either variety of English in its own right.
The Brown corpus is so named because it was developed at Brown University in the US. LOB is an acronym,
standing for Lancaster-Oslo-Bergen, the three Universities that collaborated to build that corpus.
Back to the snapshot metaphor! The two corpora can be compared because they are composed in the same
way - the subject is the same, if you like. They look at broadly the same genres. Those genres are represented
by similarly sized and numbers of chunks of data. Also, of course, the data was gathered in roughly the same
time period.
The genres covered in the two corpora are outlined below. Note the letter code for each genre - that is
important, as it shows you which genre is associated with which file in the corpus. Following the letter code is a
description of the type of data in the category, followed by two numbers in parentheses - the first is the
number of chunks of data in that category in Brown, the second is the number of chunks of data in that
category in LOB. There are five hundred chunks of data in each corpus. Each chunk is approximately 2,000
words in size, giving a rough overall corpus size of 1,000,000 words each.
корпусная лингвистика

More Related Content

Viewers also liked

최신작게임『SX797』『СOM』바카라싸이트
최신작게임『SX797』『СOM』바카라싸이트최신작게임『SX797』『СOM』바카라싸이트
최신작게임『SX797』『СOM』바카라싸이트gaoisjdoaj
 
Himanshu singh visual cv
Himanshu singh visual cvHimanshu singh visual cv
Himanshu singh visual cvHimanshu Singh
 
Jeff Guidie - Sales Executive
Jeff Guidie - Sales ExecutiveJeff Guidie - Sales Executive
Jeff Guidie - Sales ExecutiveJamay Hazley
 
Payler. Общая презентация
Payler. Общая презентацияPayler. Общая презентация
Payler. Общая презентацияPayler_
 
resume (2)
resume (2)resume (2)
resume (2)muath me
 
Viramgam mhv report 2014
Viramgam mhv report 2014Viramgam mhv report 2014
Viramgam mhv report 2014DHARASANSTHAN
 
Proiectul pilot „Îmbunătățirea serviciilor de aprovizionare cu apă și canaliz...
Proiectul pilot „Îmbunătățirea serviciilor de aprovizionare cu apă și canaliz...Proiectul pilot „Îmbunătățirea serviciilor de aprovizionare cu apă și canaliz...
Proiectul pilot „Îmbunătățirea serviciilor de aprovizionare cu apă și canaliz...ADR Nord
 
Hacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnHacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnMoriyoshi Koizumi
 
사행성식보 사이트 『OX600』。『COM』홀덤배우기 사이트
사행성식보 사이트 『OX600』。『COM』홀덤배우기 사이트사행성식보 사이트 『OX600』。『COM』홀덤배우기 사이트
사행성식보 사이트 『OX600』。『COM』홀덤배우기 사이트gaoisjdoaj
 
사행성섯다 ''SX797.COM'' 마작게임
사행성섯다 ''SX797.COM'' 마작게임사행성섯다 ''SX797.COM'' 마작게임
사행성섯다 ''SX797.COM'' 마작게임gaoisjdoaj
 
Presentación sin título
Presentación sin títuloPresentación sin título
Presentación sin títuloElio MasyRubi
 

Viewers also liked (15)

For real
For realFor real
For real
 
최신작게임『SX797』『СOM』바카라싸이트
최신작게임『SX797』『СOM』바카라싸이트최신작게임『SX797』『СOM』바카라싸이트
최신작게임『SX797』『СOM』바카라싸이트
 
Himanshu singh visual cv
Himanshu singh visual cvHimanshu singh visual cv
Himanshu singh visual cv
 
Jeff Guidie - Sales Executive
Jeff Guidie - Sales ExecutiveJeff Guidie - Sales Executive
Jeff Guidie - Sales Executive
 
Payler. Общая презентация
Payler. Общая презентацияPayler. Общая презентация
Payler. Общая презентация
 
resume (2)
resume (2)resume (2)
resume (2)
 
Viramgam mhv report 2014
Viramgam mhv report 2014Viramgam mhv report 2014
Viramgam mhv report 2014
 
Proiectul pilot „Îmbunătățirea serviciilor de aprovizionare cu apă și canaliz...
Proiectul pilot „Îmbunătățirea serviciilor de aprovizionare cu apă și canaliz...Proiectul pilot „Îmbunătățirea serviciilor de aprovizionare cu apă și canaliz...
Proiectul pilot „Îmbunătățirea serviciilor de aprovizionare cu apă și canaliz...
 
Hacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnHacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 Autumn
 
Guia metodos
Guia metodosGuia metodos
Guia metodos
 
Piscina quirinopolis
Piscina quirinopolisPiscina quirinopolis
Piscina quirinopolis
 
사행성식보 사이트 『OX600』。『COM』홀덤배우기 사이트
사행성식보 사이트 『OX600』。『COM』홀덤배우기 사이트사행성식보 사이트 『OX600』。『COM』홀덤배우기 사이트
사행성식보 사이트 『OX600』。『COM』홀덤배우기 사이트
 
사행성섯다 ''SX797.COM'' 마작게임
사행성섯다 ''SX797.COM'' 마작게임사행성섯다 ''SX797.COM'' 마작게임
사행성섯다 ''SX797.COM'' 마작게임
 
Coursera B5AZ6GGSFS9S
Coursera B5AZ6GGSFS9SCoursera B5AZ6GGSFS9S
Coursera B5AZ6GGSFS9S
 
Presentación sin título
Presentación sin títuloPresentación sin título
Presentación sin título
 

Similar to корпусная лингвистика

A2 english language word formation processes
A2 english language word formation processesA2 english language word formation processes
A2 english language word formation processesRobertagillum
 
Comparing the differences between standard english and singlish.finihed one!!!!
Comparing the differences between standard english and singlish.finihed one!!!!Comparing the differences between standard english and singlish.finihed one!!!!
Comparing the differences between standard english and singlish.finihed one!!!!VLADV423
 
Powerpoint elrs 2
Powerpoint elrs 2Powerpoint elrs 2
Powerpoint elrs 2Josh Roy
 
The Corpus In The Classroom
The Corpus In The ClassroomThe Corpus In The Classroom
The Corpus In The ClassroomColin Graham
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdfSoha82
 
How to Paraphrase a Sentence & Effective Word Choice
How to Paraphrase a Sentence & Effective Word ChoiceHow to Paraphrase a Sentence & Effective Word Choice
How to Paraphrase a Sentence & Effective Word Choicesejin cheon
 
June2010 feedback How to tackle the yr 13 Language Exam
June2010 feedback How to tackle the yr 13 Language ExamJune2010 feedback How to tackle the yr 13 Language Exam
June2010 feedback How to tackle the yr 13 Language Examsteddyss
 
English Essay. How to Write an English Essay with Sample Essays - wikiHow - H...
English Essay. How to Write an English Essay with Sample Essays - wikiHow - H...English Essay. How to Write an English Essay with Sample Essays - wikiHow - H...
English Essay. How to Write an English Essay with Sample Essays - wikiHow - H...Bridget Zhao
 
Kirike Dictionary[1]
Kirike Dictionary[1]Kirike Dictionary[1]
Kirike Dictionary[1]Clara Rufai
 
Patrick Hanks - Why lexicographers should take more notice of phraseology, co...
Patrick Hanks - Why lexicographers should take more notice of phraseology, co...Patrick Hanks - Why lexicographers should take more notice of phraseology, co...
Patrick Hanks - Why lexicographers should take more notice of phraseology, co...Scottish Language Dictionaries
 
FAQs about the English Language: Vocabulary
FAQs about the English Language: VocabularyFAQs about the English Language: Vocabulary
FAQs about the English Language: VocabularyESL Reading
 
Writing research articles in English, by Adrian Wallwork
Writing research articles in English, by Adrian WallworkWriting research articles in English, by Adrian Wallwork
Writing research articles in English, by Adrian Wallworkcampusmarenostrum
 
Academic writing
Academic writingAcademic writing
Academic writingBSBEtalk
 
what_is_language.ppt.pdf
what_is_language.ppt.pdfwhat_is_language.ppt.pdf
what_is_language.ppt.pdfking969492
 
Scientific and Technical Translation in English: Week 2
Scientific and Technical Translation in English: Week 2Scientific and Technical Translation in English: Week 2
Scientific and Technical Translation in English: Week 2Ron Martinez
 

Similar to корпусная лингвистика (20)

A2 english language word formation processes
A2 english language word formation processesA2 english language word formation processes
A2 english language word formation processes
 
Comparing the differences between standard english and singlish.finihed one!!!!
Comparing the differences between standard english and singlish.finihed one!!!!Comparing the differences between standard english and singlish.finihed one!!!!
Comparing the differences between standard english and singlish.finihed one!!!!
 
NCIHC WEBINAR: Translation as a Tool in the Interpreter Toolbox
NCIHC WEBINAR: Translation as a Tool in the Interpreter ToolboxNCIHC WEBINAR: Translation as a Tool in the Interpreter Toolbox
NCIHC WEBINAR: Translation as a Tool in the Interpreter Toolbox
 
Powerpoint elrs 2
Powerpoint elrs 2Powerpoint elrs 2
Powerpoint elrs 2
 
The Corpus In The Classroom
The Corpus In The ClassroomThe Corpus In The Classroom
The Corpus In The Classroom
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
lexicographic evidence
lexicographic evidencelexicographic evidence
lexicographic evidence
 
How to Paraphrase a Sentence & Effective Word Choice
How to Paraphrase a Sentence & Effective Word ChoiceHow to Paraphrase a Sentence & Effective Word Choice
How to Paraphrase a Sentence & Effective Word Choice
 
June2010 feedback How to tackle the yr 13 Language Exam
June2010 feedback How to tackle the yr 13 Language ExamJune2010 feedback How to tackle the yr 13 Language Exam
June2010 feedback How to tackle the yr 13 Language Exam
 
Discourse
Discourse Discourse
Discourse
 
English Essay. How to Write an English Essay with Sample Essays - wikiHow - H...
English Essay. How to Write an English Essay with Sample Essays - wikiHow - H...English Essay. How to Write an English Essay with Sample Essays - wikiHow - H...
English Essay. How to Write an English Essay with Sample Essays - wikiHow - H...
 
Kirike Dictionary[1]
Kirike Dictionary[1]Kirike Dictionary[1]
Kirike Dictionary[1]
 
Patrick Hanks - Why lexicographers should take more notice of phraseology, co...
Patrick Hanks - Why lexicographers should take more notice of phraseology, co...Patrick Hanks - Why lexicographers should take more notice of phraseology, co...
Patrick Hanks - Why lexicographers should take more notice of phraseology, co...
 
FAQs about the English Language: Vocabulary
FAQs about the English Language: VocabularyFAQs about the English Language: Vocabulary
FAQs about the English Language: Vocabulary
 
Syntax
SyntaxSyntax
Syntax
 
Writing research articles in English, by Adrian Wallwork
Writing research articles in English, by Adrian WallworkWriting research articles in English, by Adrian Wallwork
Writing research articles in English, by Adrian Wallwork
 
Academic writing
Academic writingAcademic writing
Academic writing
 
what_is_language.ppt.pdf
what_is_language.ppt.pdfwhat_is_language.ppt.pdf
what_is_language.ppt.pdf
 
Scientific and Technical Translation in English: Week 2
Scientific and Technical Translation in English: Week 2Scientific and Technical Translation in English: Week 2
Scientific and Technical Translation in English: Week 2
 
Writing Process
Writing ProcessWriting Process
Writing Process
 

Recently uploaded

Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 

Recently uploaded (20)

Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 

корпусная лингвистика

  • 1. Корпусная лингвистика Введение в корпусную лингвистику
  • 2. What is a corpus? • a collection of words? • Is it a theory or methodology of language?
  • 3. Why use a corpus? • Large amounts of data tell us about tendencies and what’s normal or typical in real-life language use • Corpora also reveal instances of very rare or exceptional cases, that we wouldn’t get from looking at single texts or introspection. • Human researchers make mistakes and are slow. Computers are much quicker and more accurate.
  • 4. Criteria in building a corpus 1. It must be a large body of text. 2. It needs to be representative of language (or a genre of language). 3. Must be in machine-readable form (e.g. txt files on a computer). 4. Acts as a standard reference about what’s typical in language. 5. Often annotated with additional linguistic information – e.g. grammatical codes.
  • 5. annotation and mark-up corpus texts may be enriched with additional information to ease analysis. Note that this type of additional information may be called ‘mark up’, ‘annotation’, or ‘tagging’. All three terms are near synonyms. Annotation usually refers to linguistic information encoded in a corpus - however, the encoding is achieved using a mark-up language. Similarly, the annotation itself is usually undertaken by putting so called tags - short codes to indicate some linguistics feature - into a text. Hence, while the terms can be separated, they can also be used inter-changeably! One final note - an xml tag finishes with a forward slash rather than a back slash.
  • 6. Some untagged text “Arrest warrant out for Clowes’ partner years before collapse.” By Daniel John A WARRANT for the arrest of the former partner of Mr Peter Clowes was issued seven years before his Barlow Clowes investment empire collapsed, according to evidence submitted to the Parliamentary Ombudsman.
  • 7. Add tags for headers and paragraphs <head type=MAIN> “Arrest warrant out for Clowes’ partner years before collapse.” </head> <head type=BYLINE> By Daniel John </head> <p> A WARRANT for the arrest of the former partner of Mr Peter Clowes was issued seven years before his Barlow Clowes investment empire collapsed, according to evidence submitted to the Parliamentary Ombudsman. </p>
  • 8. • Add sentence tags <head type=MAIN> <s n=001>“Arrest warrant out for Clowes’ partner years before collapse.” </head> <head type=BYLINE> <s n=002>By Daniel John </head> <p> <s n=003>A WARRANT for the arrest of the former partner of Mr Peter Clowes was issued seven years before his Barlow Clowes investment empire collapsed, according to evidence submitted to the Parliamentary Ombudsman. </p>
  • 9. Change quotes to SGML <head type=MAIN> <s n=001>&bquo;Arrest warrant out for Clowes’ partner years before collapse&equo; </head> <head type=BYLINE> <s n=002>By Daniel John </head> <p> <s n=003>A WARRANT for the arrest of the former partner of Mr Peter Clowes was issued seven years before his Barlow Clowes investment empire collapsed, according to evidence submitted to the Parliamentary Ombudsman. </p>
  • 10. Add tags for punctuation <head type=MAIN> <s n=001><c PUQ>&bquo;Arrest warrant out for Clowes<c PUN>’ partner years before collapse <c PUQ>&equo; </head> <head type=BYLINE> <s n=002>By Daniel John </head> <p> <s n=003>A WARRANT for the arrest of the former partner of Mr Peter Clowes was issued seven years before his Barlow Clowes investment empire collapsed, according to evidence submitted to the Parliamentary Ombudsman <c PUN>. </p>
  • 11. Add grammatical codes to word units <head type=MAIN> <s n=001><c PUQ>&bquo<w NN1>Arrest <w NN1>warrant <w AVP>out <w PRP>for <w NP0>Clowes<c PUN>’ <w NN1>partner <w NN2>years <w PRP>before <w NN1>collapse <c PUQ>&equo’ <c PUN>. </head> <head type=BYLINE> <s n=002><w PRP>By <w NP0>Daniel <w NP0>John </head> <p> <s n=003><w AT1>A <w=NN1>WARRANT <w PRP>for <w AT0>the <w NN1>arrest <w PRF>of <w AT0>the <w DT0>former <w NN1>partner <w PRF>of <w NP0>Mr <w NP0>Peter <w NP0>Clowes <w VBD>was <w VVN>issued <w CRD>seven <w NN2>years <w CJS>before <w DPS>his <w NN1-NP0>Barlow <w NP0>Clowes <w NN1>investment <w NN1>empire <w VVD>collapsed<c PUN>, <w PRP>according to <w NN1>evidence <w VVN>submitted <w PRP>to <w AT0>the <w AJ0>Parliamentary <w NN1>Ombudsman<c PUN>. </p>
  • 12. Types of Corpora 1 Specialised corpus – e.g. • genre: the language of newspapers • time: 2005 to the present day • place: just texts published in China 2 General corpus – needs to be much larger. E.g. The British National Corpus (BNC) has about 100 million words of spoken and written British English:
  • 14. Types of Corpora 3. Multilingual corpus – e.g. English and Spanish. Or American English and Indian English. http://ice-corpora.net/ICE/INDEX.HTM 4. Parallel corpus – e.g. English and Spanish – exactly the same texts translated. E.g. the CRATER corpus http://catalog.elra.info/product_info.php?products_id=84 5. Learner corpus – language use created by people learning a particular language. E.g. the International Corpus of Learner English. 6. Historical or Diachronic corpus – e.g. Helsinki corpus – 1.5 million words of texts from 700AD to 1700AD. 7. Monitor corpus – continually being added to. e.g. the Bank of English http://www.collins.co.uk/page/Wordbanks+Online
  • 15.
  • 16.
  • 17.
  • 18. frequency data, concordances and collocation • Frequencies Your query "wash" returned 2415 matches in 952 different texts (in 97,626,093 words; freq: 24.74 instances per million words)
  • 19. Concordances aka Key Word In Context
  • 22.
  • 23. Corpora and Language Teaching Textbooks • Dictionaries • Classroom Exercises • Tests • Learner Corpora
  • 24. Limitations of Corpus linguistics It won’t tell us if something is possible in a language, or well-formed. E.g. is “he expired of heart disease” acceptable English? • Any generalisations we make from corpus data can only be deductions – not facts. • Corpora give us evidence, but not information or explanations. Why do women say “wash” more than men? • Corpora give us language out of context – so no visual information e.g. pictures, fonts etc. And with spoken data – no information on what the speakers look like, behaviour or body language.
  • 25. Further Reading • McEnery, Tony & Wilson, Andrew (2001) Corpus Linguistics. Edinburgh: Edinburgh University Press. Chapter 1. • Hunston, S. (2002) Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Chapter 1.
  • 26. Question 1 What is a corpus? • A theory of language. • A collection of texts stored on a computer. • An electronic database similar to a dictionary. • Any large collection of words such as a collection of books, newspapers or magazines.
  • 27. Question 2 What is the main reason for using corpora? • Other methods of language analysis are not reliable. • Computers can confirm our intuitions about language. • Computers can help us discover interesting patterns in language which would be difficult to spot otherwise. • With corpora we can answer all research questions about language.
  • 28. Question 3 What is corpus annotation? • Adding an extra layer of information to the text to allow for more sophisticated searches. • Separating text into sentences. • Manual coding of text for parts of speech. • Adding critical comments to a text.
  • 29. Question 4 What is a specialised corpus? • A corpus that is used for historical language investigations. • A corpus that is composed of a large variety of genres. • A corpus that is used by language specialists. • • A corpus that focuses on e.g. one type of genre, one period, one place etc.
  • 30. Question 5 Which of these is NOT a type of corpus? • Multilingual corpus • Learner corpus • Diachronic corpus • Observer corpus
  • 31. Question 6 What is the BNC? • A large general corpus of British English. • A corpus of different genres of English writing. • A large spoken corpus of British English. • A specialised corpus representing the language of newspapers.
  • 32. Question 7 Which of these statements is NOT true about a monitor corpus? • It is frequently updated. • The Bank of English is an example of a monitor corpus. • The BNC is an example of a monitor corpus. • It is used to monitor rapid change in language.
  • 33. Question 8 What is a concordance? • Information about word frequencies normalised per million words. • Listing of examples of a word searched in a corpus with some context on the right and some context on the left. • An alphabetical list of words that appear in a text. • A list of words and their frequencies that can be used for identifying important words in a text.
  • 34. Question 9 What is collocation? • The tendency of speakers to talk over each other. • The tendency of words to co-occur with one another. • The tendency of words to appear in unique, different contexts each time. • The tendency of sentences to create meaning.
  • 35. Question 10 What is a frequency distribution in a corpus? • Information about how frequent a word is in a corpus. • Information about the frequency of use of a term across a number of different texts, corpus sections, speakers etc. • Information about how frequent a word is per million words. • Sociolinguistic information about the gender of the speakers that are represented in a corpus.
  • 36. Brown and LOB View 80 comments These corpora are sometimes referred to as ‘snapshot’ corpora - their design is such that they try to represent a broad range of genres of published, professionally authored, English. Their goal is to capture the language at one moment in time, hence the term ‘snapshot’. Of course, as with any snapshot there are things you see and things you do not see. So, in this case, we are looking at professionally authored written English - not speech and not writing of a more informal variety. We are also only looking at certain genres. As with any snapshot, it was taken at a certain point of time in a certain place - Brown is America in the early 1960s, LOB is the UK in the early 1960s. Such corpora are often used to compare and contrast varieties of a language - in this case two varieties of English. They can also be looked at on their own to explore either variety of English in its own right. The Brown corpus is so named because it was developed at Brown University in the US. LOB is an acronym, standing for Lancaster-Oslo-Bergen, the three Universities that collaborated to build that corpus. Back to the snapshot metaphor! The two corpora can be compared because they are composed in the same way - the subject is the same, if you like. They look at broadly the same genres. Those genres are represented by similarly sized and numbers of chunks of data. Also, of course, the data was gathered in roughly the same time period. The genres covered in the two corpora are outlined below. Note the letter code for each genre - that is important, as it shows you which genre is associated with which file in the corpus. Following the letter code is a description of the type of data in the category, followed by two numbers in parentheses - the first is the number of chunks of data in that category in Brown, the second is the number of chunks of data in that category in LOB. There are five hundred chunks of data in each corpus. Each chunk is approximately 2,000 words in size, giving a rough overall corpus size of 1,000,000 words each.