Corpus linguistics

Mr Jitendra B. Patil
Assistant Professor of English
Pratap College Amalner
Dist – Jalgaon (Maharshtra)
Pin-425401 Mob.- 919421655091
Email- jitendrapca@gmail.com

 Corpus (Latin) means ‘body’
 any body of text
 new approach to language study
 collects samples of text from various fields of
language use in a scientific and systematic way
 Corpus: a statistically sampled language
database
 Purposes: investigation, description, application, and
analyses relevant to all branches of linguistics

Indispensability of Corpus in Linguistics:
Due to large structure, varied composition, huge
information, confirmed referential authenticity, wide
representation, easy usability and simple verifiability
Usages:
To verify earlier proposition and examples
To verify logic of pre -proposed definitions and
explanations

Corpus in Corpus Linguistics:
Holds special connotations
A large collection of linguistic data used as a starting point
of logistic description
A body of language text in written and spoken form
Represents varieties of language used at each and every field
of human interaction
Preserves in machine readable form
Enables all kinds of linguistic description and analysis

Corpus means a large collection of texts assumed to be
representative of a given language, dialect or other subset of
language, to be used for linguistic analyses.
Corpus is a large collection of pieces of language that are
selected and ordered according to some explicit linguistic
criteria in order to be used as samples of the language.
Corpus is a large collection of naturally occurring language
texts presented in machine-readable form accumulated in
scientific manner to characterize a particular variety or use of
language.

A corpus, which contains constituent pieces of language
that are documented as to their origin and provenance, is
encoded in a standard and homogenous way for open-
ended retrieval tasks.
Linguistics have always used the word ‘Corpus’ to
describe a collection of naturally occurring examples of
language ,consisting of anything from a set of written text
or tape recordings which have been collected for linguistic
study.

A corpus refers to :
Any body of text
A body of machine-readable text
A finite collection of machine-readable texts which are
sampled to maximally representative of language or
language variety.
Important Issues in Corpus Designing:
Composition of a corpus
Usage potential of a corpus

A Corpus should-
Faithfully represent both common and special linguistic features
of a language from where it is designed and developed
Be large enough to encompass samples of text from various
disciplines
Be a true replica of physical texts
Preserve various forms of words, punctuation marks, spellings,
variations and other orthographic symbols used in the source text.
Represent all linguistic usage varieties in a propositional manner
Use authentic, referential and verifiable Text samples
Enable user to use language data in multiple tasks
Preserve texts in annotated and non-annotated form

Quantity:
No fixed parameter
The bigger the corpus ,the better its authenticity and
reliability
Data from a variety of sources in large quantity
Refers to the sum of the total linguistic component
included
Electronic corpus generation contains millions of words

Quality:
Relates to authenticity
Collection from genuine communications
Depends on ideal restriction of corpus collectors role
Databases should be drawn from actual reality
Interactional properties of casual and informal talks

Representativeness:
Proper representation of a broad range of material
Representative of maximum linguistic features
Authentic in representation of text variety
Maximally representative of demographical variables
Overall size of corpus to be set against the diversity of
sources
Random selection of text samples

Simplicity:
Simple and plain text samples
Unbroken string of characters without any added
information
Separate Preservation of additional features
Separate storage of Extralinguistic information

Equality :
Text sample with equal number of words
 balance between spoken text sample and written text
sample
Collection of equal amount of text from all sources
Balance in case of quality of samples

Retrievability :
Easy Retrievability of data by end user
Techniques and tools preserving data in electronic forma
Accessibility for all

Verifiability:
Must be open to empirical verification
Reflective of actual of patterns of language use
Authentic and valid in synchronic and diachronic studies

Augmentation:
Changeable with time
Can be synchronic
Can be diachronic

Documentation :
Separation of documentary information from the components
Meticulous documentation of extralinguistic information
Easy retrieval of extralinguistic information (annotated info)

Management :
Necessary scheme for maintenance, standardization,
augmentation and upgrading
Preservation of data from virus infection
Displacement of corpus data
Conversion of Corpus data across different formats
Adaptation of new hardware and software technology

Corpus linguistics

More Related Content

What's hot

Similar to Corpus linguistics

Recently uploaded

Corpus linguistics