The document discusses corpus linguistics and different types of corpora. It defines corpus linguistics as the study of language based on large collections of electronic texts, known as corpora. It describes general corpora, specialized corpora, historical/diachronic corpora, regional corpora, learner corpora, multilingual corpora, comparable corpora, and parallel corpora. It also discusses corpus annotation, concordancing, frequency and keyword lists, collocation, and software used for corpus analysis.
Presentation developed for the class of Tópicos de Semântica em Inglês, under the responsability of Professor Elizabeth at the University of São Paulo, in the first semester of 2014.
Presentation developed for the class of Tópicos de Semântica em Inglês, under the responsability of Professor Elizabeth at the University of São Paulo, in the first semester of 2014.
Two Views of Discourse Structure: As a Product and As a ProcessCRISALDO CORDURA
This is are 3 presenter presentation on the discussion of "Two Views of Discourse Structure: As a Product and As a Process"
Credit to
https://uomustansiriyah.edu.iq/media/lectures/8/8_2020_03_30!04_57_35_PM.pptx
and
The book from the school
Introductory lecture on Corpus Linguistics. Contents: Corpus linguistics: past and present, What is a corpus?, Why use computers to study language? Corpus-based vs. Intuition-based approach, Theory vs. Methodology.
This lecture was based on McEnery et al. 2006. Corpus-based Language Studies. An Advanced resource book. Routlege.
This presentation provides a general overview about syllabus design. The presenation highlights the definiton of syllabus, types of syllabi, components of syllabus and the scope of syllabus design. It also sheds the light on the relationship between syllabus design and curriculum development. By the end of this presentation, students will gain general understanding or syllabus design.
Two Views of Discourse Structure: As a Product and As a ProcessCRISALDO CORDURA
This is are 3 presenter presentation on the discussion of "Two Views of Discourse Structure: As a Product and As a Process"
Credit to
https://uomustansiriyah.edu.iq/media/lectures/8/8_2020_03_30!04_57_35_PM.pptx
and
The book from the school
Introductory lecture on Corpus Linguistics. Contents: Corpus linguistics: past and present, What is a corpus?, Why use computers to study language? Corpus-based vs. Intuition-based approach, Theory vs. Methodology.
This lecture was based on McEnery et al. 2006. Corpus-based Language Studies. An Advanced resource book. Routlege.
This presentation provides a general overview about syllabus design. The presenation highlights the definiton of syllabus, types of syllabi, components of syllabus and the scope of syllabus design. It also sheds the light on the relationship between syllabus design and curriculum development. By the end of this presentation, students will gain general understanding or syllabus design.
Corpus annotation for corpus linguistics (nov2009)Jorge Baptista
Lecture on corpus annotation for corpus linguistics. Contents: DIY corpus, e-texts, character set and text encoding issues, document structure, DTDs, documentation;
tools and issues in annotation procedures, good practices; examples from anaphora resolution and named entity recognition annotation campaigns; evaluation of corpus annotation
Helping Teachers Meet Learner Needs Through Innovative Online Diagnostic Asse...CALPER
Presentation at the 2011 American Council of the Teaching of Foreign Languages (ACTFL). Description of Computerized Dynamic Assessment Tests developed for assessing listening comprehension in Chinese, Russian, and French. Test developed by the Center for Language Acquisition (CLA) and the Center for Advanced Language Proficiency Education and Research (CALPER) at the Pennsylvania State University.
This ppt provides summarized ideas of the relation between discourse analysis and language teaching. This ppt was used of the course "Discourse Analysis" at UCSC.
English Language learners: This is a step-by-step 34-slide presentation to help you understand, remember and apply connecting words, so you can build stronger sentences. (Created by Rita Zuba Prokopetz / G&R Languages – June, 2012)
CBTS is considered a new paradigm in the discipline of Translation Studies. it is also considered a new methodology , which based is on Corpus linguistics and DTS......
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
6. A Latin word “body / mass”
A collection of written texts, especially the entire
works of a particular author or a body of writing
on a particular subject: "the Darwinian corpus“
Corpora (plural)
7. History of Corpus Linguistics
Language study is not a new idea.
1921: 30,000 words. A Treasure, but of no use.
1960 with the advent of computer....
The use of collections of COMPUTER-READABLE text for
language study.
Brown Corpus of Standard American English.
One million words of American English texts printed in 1964.
First electronic corpus
8.
9. Corpus Linguistics
Linguistics being the scientific study of language
and its structure, ‘corpus linguistics’ is the study
of language “on the basis of text corpora.”
The analysis does not stop at the description of
those texts; rather the contexts are also focused
upon.
10. Place for Corpus Linguistics in
Applied Linguistics
A means to explore actual patterns of language use.
A tool for developing materials for classroom language
instruction.
To explore different questions about language use.
To provide powerful tools for analysis of natural
languages.
To give an insight about how language use varies in
different situations.
11. Corpora
‘Corpora’ are a large and structured set of texts
(nowadays usually electronically stored and
processed).
They are used to do statistical analysis and
hypothesis testing, checking occurrences or
validating linguistic rules within a specific
language territory.
12.
13. General Corpora
The texts that do not belong to a single text type,
subject field, or register.
May include written or spoken language, or both.
May include texts produced in one country or
many.
They aim to represent language in its broadest
sense and to serve as a widely available resource
for baseline or comparative studies of general
linguistic features.
14. May be used to produce reference materials for
language learning or translation.
Often used as a baseline in comparison with more
specialized corpora.
Also sometimes known as ‘reference corpora’.
15. Examples
Brown Corpus – 1 million words.
LOB Corpus – 1 million words.
BNC (British National Corpus) – 100 million
words.
16. Specialized Corpora
Texts that are designed with more specific research goals
in mind – register-specific descriptions and
investigations of language.
It aims to be representative of a given type of text.
Used to investigate a particular type of language.
The kind of texts included are limited:
A time frame – such as a particular century.
A social setting – such as conversations taking place in
a bookshop.
A given topic – such as newspaper articles dealing
with a particular thing.
17. Examples
Cambridge and Nottingham Corpus
of Discourse in English
(CANCODE) (informal registers of
British English) – 5 million words.
Michigan Corpus of Academic
Spoken English (MICASE) (spoken
registers in a US academic setting) –
5 million words.
18. Historical or Diachronic Corpora
Texts from different
periods of time.
Aim at representing an
earlier stage(s) of a
language.
They help to trace the
development of a language
over time.
21. Learner’s Corpora
Aim at representing the language as produced by the
learners of a language, and they include spoken or
written language samples produced by non-native
speakers.
They are used to identify differences among learners’
frequency of words and types of mistakes.
In what respects learners differ from each other and
from the language of native speakers
22. Example
Louvain Corpus of Native
English Essays (LOCNEE)
International Corpus of Learner
English (ICLE)
20,000 words
23. Multilingual Corpora
Any systematic collection of empirical language data
enabling linguists to carry out analyses of multilingual
individuals, multilingual societies or multilingual
communication.
24. Comparable Corpora
Two (or more) corpora in different languages (e.g. English
and Spanish) or in different varieties of a language (e.g.
Indian English and Canadian English).
They are designed along the same lines – will contain the
same proportions of newspaper texts, novels, casual
conversation, etc.
Comparable corpora of varieties of the same language can be
used to compare those varieties.
Comparable corpora of different languages can be used by
translators to identify differences and equivalences in each
language.
25. Example
International Corpus of English (ICE) are
comparable corpora of 1 million words each of
different varieties of English.
26. Parallel Corpora
Two (or more) corpora in different languages, each
containing texts that have been translated from one
language into the other, or texts that have been produced
simultaneously in two or more languages.
Can be used by translators and by learners to find
potential equivalent expressions in each language and to
investigate differences between languages.
27.
28. Size
Representativeness
Registers / modes / topics
Demographics
Production / reception
Research goals
Funding
Time
Staff/students
31. Spoken Corpora
deciding on a transcription system
I. prosodic/non prosodic
II. representing interactional characteristics of
speech (over lapping speech, back channels,
pauses, non-verbal contextual events)
III. permission to use data
IV. ensuring anonymity
V. avoiding impracticality of data
32. Markup
1. Structural markups:
-written corpus: Titles, authors, paragraphs, subheadings,
chapters etc.
-spoken corpus: Contextual events, paralinguistic features
2: Header:
-written corpus:
Classification into categories(register, genre, topic domain, discourse
mode, formality)
-spoken corpus:
Demographic infirmation about speaker(gender,social
class,occupation,age,native language/dialect)
Relationship among the participants
33. Linguistic Annotation
Parts of Speech Tagging:
Grammatical category, case assigning
Prosodic Annotation
Phonetic Annotation
Syntactic Parsing
34. Advantages of Tagging
Vast exploration
Frequency
Co-occurance
Multiple meaning studies
Automatically retrievable
35.
36. Concordance Lines
Concordance lines are a useful tool for
investigating corpora, but their use is limited by
the ability of the human observer to process
information.
There are some statistical calculations of
collocation and corpus annotation.
37. Frequency and Key-word Lists
A frequency list is a list of all the types in a corpus
together with the number of occurrences of each type.
Comparing the frequency lists for two corpora can give
interesting information
About the differences between the two texts.
e.g.) Kennedy (1998)
a comparison between a corpus of Economics texts
and one of general academic English→ the words price,
cost, demand, curve, firm… are frequently found in the
Economics corpus.
38. Keywords
A useful starting point in investigating a
specialized corpus.
They can be lexical items which reflect the topic
of a particular text but also grammatical words
which convey more subtle information.
39. Collocation
The tendency of words to be biased in the way
they co-occur.
Statistical measurements of collocation are more
reliable, and for this reason a
corpus is essential.
40. Measurements of Collocation
Computer programs, which calculate collocation,
take a node word and count the instances of all words
occurring within a particular span.
(note) the count ignores punctuation marks.
Counts ‘s’ as a separate word.
Ignores sentence boundaries.
41. Tagging and Parsing
Tagging is allocating a part of speech (POS) label
to each word in a corpus.
e.g.) the word light ・・・tagged as verb, a noun
or an adjective each time it occurs in the corpus.
Parsing is analyzing the sentences in a corpus
into their constituent parts, that is, doing.
42. Annotation
General term for tagging and parsing, and also
used to describe other kinds of categorisation that
may be performed on a corpus.
(e.g.) The annotation of a spoken corpus for
prosodic features.
The annotation of a corpus of learner English for
types of error.
Annotation of anaphora and semantic annotation.
43. Softwares
Special software is used in order to analyze a corpus and
certain words or phrases.
For example
• Sara for the BNC
• ICECUP for the ICE Great Britain.
• Concordancers can be used for the analysis of almost
any corpus.
44. Concordancer
One of the most frequently used concordancers is
‘Wordsmith Tools’.
Its two most important tools are:
Concord and WordList
As an alternative to Wordsmith, you can also use a
concordancer called ‘AntConc’ which can be
downloaded for free.
45. WordSmith Concord
Click on the Wordsmith icon on the desktop to
open the program. Select concord in order to
search a corpus for a certain word or phrase. You
can now choose a corpus and select those files of
the corpus you want to analyse.
46.
47. Some further options for entering a search word or
phrase:
By using the asterisk *, you can widen the scope of your
search. For example, entering going as a search word will
provide you only with all instances of going; entering going
to with all instances of going to. If you type in go*, on the
other hand, you will get all words beginning with go-, e.g.
going, goes, gold. Searching for *ing, you will get all words
ending in –ing, e.g. swimming, dancing, sing.
48. WordSmith WordList
The tool WordList generates word lists of the
selected text files and enables you to compare the
length of text files or corpora.
Moreover, you can use WordList to compare the
frequency of a word in different text files or
across genres and to identify common clusters.
49. AntConc Concordance tool
This tool shows the words or word strings you want to
analyse in their textual context.
Select the files you want to analyse: File > Open file(s)
Choose the tab "Concordance"
Type in a search word (“Search Term”, bottom left-hand
corner)
50.
51. More reliable than intuition.
Language patterns are easily identified.
Deconstruct texts to discover patterns.
Track the development of specific features in the
history of English.
Test hypothesis on specific language features
empirically.
Follow language acquisition properly.
Draw conclusions on large amounts of linguistics data.
Frequency rather than the possibility.
Not always a complete picture.
52.
53. More communicative modes:
spoken corpora, interactional corpora (classroom
interactions, authentic interactions, etc) multimodal
corpora, corpora of textbook materials, etc.
More text types and genres, to cover text types which are
less represented in corpora (letters, emails, leaflets, TV
programs, book synopses, recipes, short notes, chat room
logs, etc.),
54. More longitudinal language data:
from beginners to advanced levels, from children to adults,
from L1 to L2.
More variables:
more language learning variables should be collected and
encoded at the time of corpus collection (proficiency, language
aptitude, motivation, more precise description of the task, of
temporal, social or situational settings, etc).
More languages:
to counterbalance the predominance of Anglo-Saxon native and
learner corpora and to foster the computer-aided analysis of
different languages and language families.
55.
56. Prior to Corpus Linguistics it was difficult to note patterns of
use in language, since observing and tracking usage patterns
was a monumental task.
Scholars have used various types of corpora to gain insights
into changes related to language development, both in first
and second language situations.
Corpus Linguistics can help in telling about language use and
how it varies in different situations.