Corpus linguistics is the study of language based on large collections of real-world language samples stored electronically. It allows for reliable, accurate, and replicable analysis of language at a large scale and in new ways not previously possible. A corpus is a large collection of written or spoken language samples that is stored electronically and can be analyzed using specialized software. Corpus linguistics provides insights into language usage that were previously difficult to obtain at a large scale through computer-assisted analysis of large text collections.
Two Views of Discourse Structure: As a Product and As a ProcessCRISALDO CORDURA
This is are 3 presenter presentation on the discussion of "Two Views of Discourse Structure: As a Product and As a Process"
Credit to
https://uomustansiriyah.edu.iq/media/lectures/8/8_2020_03_30!04_57_35_PM.pptx
and
The book from the school
Presentation developed for the class of Tópicos de Semântica em Inglês, under the responsability of Professor Elizabeth at the University of São Paulo, in the first semester of 2014.
Introductory lecture on Corpus Linguistics. Contents: Corpus linguistics: past and present, What is a corpus?, Why use computers to study language? Corpus-based vs. Intuition-based approach, Theory vs. Methodology.
This lecture was based on McEnery et al. 2006. Corpus-based Language Studies. An Advanced resource book. Routlege.
Two Views of Discourse Structure: As a Product and As a ProcessCRISALDO CORDURA
This is are 3 presenter presentation on the discussion of "Two Views of Discourse Structure: As a Product and As a Process"
Credit to
https://uomustansiriyah.edu.iq/media/lectures/8/8_2020_03_30!04_57_35_PM.pptx
and
The book from the school
Presentation developed for the class of Tópicos de Semântica em Inglês, under the responsability of Professor Elizabeth at the University of São Paulo, in the first semester of 2014.
Introductory lecture on Corpus Linguistics. Contents: Corpus linguistics: past and present, What is a corpus?, Why use computers to study language? Corpus-based vs. Intuition-based approach, Theory vs. Methodology.
This lecture was based on McEnery et al. 2006. Corpus-based Language Studies. An Advanced resource book. Routlege.
This slide contains about a linguistic branch which is soicolinguistics. It discusses about
*perspectives of sociolinguistics
*speech community
*varieties of sociolinguistics
*Pidgin and Creole
This slide contains about a linguistic branch which is soicolinguistics. It discusses about
*perspectives of sociolinguistics
*speech community
*varieties of sociolinguistics
*Pidgin and Creole
CBTS is considered a new paradigm in the discipline of Translation Studies. it is also considered a new methodology , which based is on Corpus linguistics and DTS......
The presentation was given by David Hirsch during ADIBF2015 Library Day, and was part of the ADIBF Academy Certificate Expert Librarian 2015.
David Hirsch serves as the Librarian for Middle Eastern and South Asian Studies and Adjunct Assistant Professor in the Department of Near Eastern Languages and Cultures at the University of California-Los Angeles (UCLA). He has also served as an advisor to the Abu Dhabi National Library and Special Coordinator at the United Arab Emirates University Library.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
The Indian economy is classified into different sectors to simplify the analysis and understanding of economic activities. For Class 10, it's essential to grasp the sectors of the Indian economy, understand their characteristics, and recognize their importance. This guide will provide detailed notes on the Sectors of the Indian Economy Class 10, using specific long-tail keywords to enhance comprehension.
For more information, visit-www.vavaclasses.com
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Corpus Linguistics
1.
2.
3.
4.
5.
6. 6/17
Computers and corpus linguistics
• Historically, manual analysis of large bodies of text
(esp. in literary and biblical studies)
– Error-prone, time-consuming, not verifiable
• Computers have introduced
– Reliability, accuracy and replicability
– increased speed and capacity means you can do more on a
grander scale
– new tools mean you can do things you might not have
thought of doing
7. 7/17
What is a corpus?
• Corpus (pl. corpora) = ‘body’
• Collection of written text or transcribed speech
• Usually but not necessarily purposefully collected
• Usually but not necessarily structured
• Usually but not necessarily annotated
• (Usually stored on and accessible via computer)
• Corpus ~ text archive
8.
9.
10. DEFINITION
• Corpus linguistics is the study
of language based on large collections of "real
life" language use stored
in corpora (or corpuses)--computerized
databases created for linguistic research. Also
known as corpus-based studies.
• Corpus linguistics is a method of carrying out
linguistic analyses.
11. • Corpus linguistics thus is the analysis of
naturally occurring language on the basis of
computerized corpora.
• Usually, the analysis is performed with the
help of the computer, i.e. with specialised
software, and takes into account the
frequency of the phenomena investigated.
12. • The availability of computers and machine-
readable text has made it possible to get data
quickly and easily and also to have this data
presented in a format suitable for analysis.
• The main task of the corpus linguist is not to
find the data but to analyse it.
13. Daniel Nkemleke, Humboldt
Kolleg Kamerun, 30/07/2008
Introduction: what is Corpus Linguistics?
• The study of language based on examples of “real life“ language use, collected,
stored and processed via computer
• Facilitated by the advent of computer technology (1960s)
• Latin: corpus (body): body of text any collection
of more than one text, written or spoken
The word "corpus", derived from the Latin word meaning "body", may be used to
refer to any text in written or spoken form.
14. 14/17
What is corpus linguistics?
• Not a branch of linguistics, like socio~,
psycho~, …
• Not a theory of linguistics
• A set of tools and methods (and a philosophy)
to support linguistic investigation across all
branches of the subject
15. INTRODUCTION TO CORPUS
LINGUISTICS
• Corpus linguistics can be described as the
study of language based on text corpora.
• A corpus is a collection of machine-readable,
authentic texts, chosen to characterize or
represent a state or variety of a language.
• Corpus v. Text archive
• Representativeness
16. WHY USE CORPORA?
• Authenticity
• Objectivity
• Verifiability
• Exposure to large amounts of data
• New insights into language
• Enhancement of learner motivation
17.
18. Best known corpora
• The Birmingham Collection of English Texts
(COBUILD)
• The Bank of English
• The British National Corpus (BNC)
• The Brown Corpus
• The Lancaster-Oslo/Bergen Corpus (LOB)
• The Helsinki Corpus of English Texts: Diachronic
and Dialectal
• The International Corpus of English (ICE)
19. Daniel Nkemleke, Humboldt
Kolleg Kamerun, 30/07/2008
Some (main) existing corpora
L1 Corpora
• Brown Corpus of American English
• Lancaster-Oslo/Bergen Corpus (LOB)
• London-Lund Corpus
• British National Corpus (BNC)
• Birmingham Corpus of British English
L2 Corpora
• ICE-East Africa (Kenya & Tanzania)
• Corpus of Cameroon English
• Corpus of Nigerian English ??
• Kolhapur Corpus of Indian English
Multinational Corpus Project
• International Corpus of English (ICE)
21. 21/17
BNC (1995)
• http://www.natcorp.ox.ac.uk/
• 100m word collection of written and spoken
text from 1975-93 (already dated in some
respects!)
• Carefully designed and balanced
• Corpus is closed (finite, synchronic)
• All text tagged to high quality
• Lots of tools available for exploration
23. Daniel Nkemleke, Humboldt
Kolleg Kamerun, 30/07/2008
Corpus utility
• possible ways in which a corpus may be useful
1. Corpora as a source of empirical data
2. Corpora in language teaching and learning
3. Corpora in Lexical studies
4. Corpora in grammar studies
5. Corpora in speech research
6. Corpora and semantic studies
7. Corpora in pragmatic and discourse studies
8. Corpora in sociolinguistic studies
9. Corpora and stylistic studies
10. Corpora in historical linguistics
11. Corpora in dialectology and variational studies
12. Corpora in Psycholinguistics
13. Corpora in cultural studies
24. BTANT 129 w5
International Corpus of English
• 20 corpora of 1 m words devoted to varieties
of English around the world
• 500 texts (300 written 200 spoken) of 2000
words each
• time span: 1990-0996
• ICE-GB available in demo version
• syntactic annotation, graphical tool ICECUP
25. BTANT 129 w5
British National Corpus
• 100 m words careful selection
• 10 % spoken material
• time span 1960 (fiction) – 1975 non-ficion)
• 40-50 000 word texts
• TEI compliant SGML coding
• http://www.comp.lancs.ac.uk/ucrel/bncindex/
26. BTANT 129 w5
Short history
Brief mention of just a select few!
• Brown Corpus (Brown university)
– 1 m words
– 15 genres
– 500 samples 2000 words each
– Area: US
– Time: 1961
• LOB Corpus (Lancaster-Bergen-Oslo)
– GB replica of Brown
27. Historical background of Corpus
Linguistics
• R. Quirk’s Survey of English Usage (SEU)
• Advent of computers
• First corpora
• The Brown Corpus
28. Corpus creation
• The design of a corpus is dependent upon
the type of a corpus and purpose for which
the corpus is to be used.
• Types of corpora (sample, monitor, general,
spoken, written, learner, translation, parallel,
comparable, etc).
29. New insights into language
• Sinclair noted (1991:1) that “traditionally linguistics has been
limited to what a single individual could experience and
remember… Starved of adequate data, linguistics languished –
indeed it became totally introverted. It became fashionable to
look inwards to the mind rather than outwards to society.
Intuition was the key, and similarity of language structure to
various formal models was emphasized. The communicative
role of language was hardly referred to…. Students of
linguistics over many years have been urged to rely heavily on
their intuition and to prefer their intuitions to actual text
where there is some discrepancy. Their study has, therefore,
been more about intuition than about language”.
30. New insights into language
• Many subtle observations.
• Corpora can help learners discover new
meanings of the words they already know.
• New understanding of meaning in Corpus
Linguistics.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44. Example -
• Some differences between strong and powerful (source:
British National Corpus):
– strong
– powerful
• The differences are subtle, but examining their collocates
helps.
wind, feeling, accent, flavour
tool, weapon, punch, engine
45. Example (British National Corpus)
• British National Corpus (BNC):
– 100 million words of English
• 90% written, 10% spoken
– Designed to be representative and balanced.
– Texts from different genres (literature, news,
academic writing…)
– Annotated: Every single word is accompanied by
part-of-speech information.
46. Example -
• A sentence in the BNC:
– Explosives found on Chalisgaon station.
• <s>
• <w NN2>Explosives
• <w VVD>found
• <w PRP>on
• <w NP0>Chalisgaon
• <w NP0>station
• <PUN>.
47. Example (continued)
• <s>
• <w NN2>Explosives
• <w VVD>found
• <w PRP>on
• <w NP0>Chalisgaon
• <w NP0>station
• <PUN>.
Explosives found on Chalisgaon station.
new sentence
plural noun
past tense verb
preposition
proper noun
noun
punctuation
48. BTANT 129 w5
One quick example
• Representativity or representativeness
• Throw the two words at Google and have a
look at the figures
• Think about the conclusions
• There are special front-end sites
53. Important to note
• This is not “raw” text.
– Annotation means we can search for particular patterns.
– E.g. for the quiver/quake study: “find all occurrences of quiver
which are verbs, followed by a determiner and a noun”
• The collection is very large
– Only in very large collections are we likely to find rare
occurrences.
• Corpus search is done by computer. You can’t trawl
through 100 million words manually!