Mr Jitendra B. Patil
Assistant Professor of English
Pratap College Amalner
Dist – Jalgaon (Maharshtra)
Pin-425401 Mob.- 919421655091
Email- jitendrapca@gmail.com
 Corpus (Latin) means ‘body’
 any body of text
 new approach to language study
 collects samples of text from various fields of
language use in a scientific and systematic way
 Corpus: a statistically sampled language
database
 Purposes: investigation, description, application, and
analyses relevant to all branches of linguistics
Indispensability of Corpus in Linguistics:
Due to large structure, varied composition, huge
information, confirmed referential authenticity, wide
representation, easy usability and simple verifiability
Usages:
To verify earlier proposition and examples
To verify logic of pre -proposed definitions and
explanations
Corpus in Corpus Linguistics:
Holds special connotations
A large collection of linguistic data used as a starting point
of logistic description
A body of language text in written and spoken form
Represents varieties of language used at each and every field
of human interaction
Preserves in machine readable form
Enables all kinds of linguistic description and analysis
Corpus means a large collection of texts assumed to be
representative of a given language, dialect or other subset of
language, to be used for linguistic analyses.
Corpus is a large collection of pieces of language that are
selected and ordered according to some explicit linguistic
criteria in order to be used as samples of the language.
Corpus is a large collection of naturally occurring language
texts presented in machine-readable form accumulated in
scientific manner to characterize a particular variety or use of
language.
A corpus, which contains constituent pieces of language
that are documented as to their origin and provenance, is
encoded in a standard and homogenous way for open-
ended retrieval tasks.
Linguistics have always used the word ‘Corpus’ to
describe a collection of naturally occurring examples of
language ,consisting of anything from a set of written text
or tape recordings which have been collected for linguistic
study.
A corpus refers to :
Any body of text
A body of machine-readable text
A finite collection of machine-readable texts which are
sampled to maximally representative of language or
language variety.
Important Issues in Corpus Designing:
Composition of a corpus
Usage potential of a corpus
A Corpus should-
Faithfully represent both common and special linguistic features
of a language from where it is designed and developed
Be large enough to encompass samples of text from various
disciplines
Be a true replica of physical texts
Preserve various forms of words, punctuation marks, spellings,
variations and other orthographic symbols used in the source text.
Represent all linguistic usage varieties in a propositional manner
Use authentic, referential and verifiable Text samples
Enable user to use language data in multiple tasks
Preserve texts in annotated and non-annotated form
Quantity:
No fixed parameter
The bigger the corpus ,the better its authenticity and
reliability
Data from a variety of sources in large quantity
Refers to the sum of the total linguistic component
included
Electronic corpus generation contains millions of words
Quality:
Relates to authenticity
Collection from genuine communications
Depends on ideal restriction of corpus collectors role
Databases should be drawn from actual reality
Interactional properties of casual and informal talks
Representativeness:
Proper representation of a broad range of material
Representative of maximum linguistic features
Authentic in representation of text variety
Maximally representative of demographical variables
Overall size of corpus to be set against the diversity of
sources
Random selection of text samples
Simplicity:
Simple and plain text samples
Unbroken string of characters without any added
information
Separate Preservation of additional features
Separate storage of Extralinguistic information
Equality :
Text sample with equal number of words
 balance between spoken text sample and written text
sample
Collection of equal amount of text from all sources
Balance in case of quality of samples
Retrievability :
Easy Retrievability of data by end user
Techniques and tools preserving data in electronic forma
Accessibility for all
Verifiability:
Must be open to empirical verification
Reflective of actual of patterns of language use
Authentic and valid in synchronic and diachronic studies
Augmentation:
Changeable with time
Can be synchronic
Can be diachronic
Documentation :
Separation of documentary information from the components
Meticulous documentation of extralinguistic information
Easy retrieval of extralinguistic information (annotated info)
Management :
Necessary scheme for maintenance, standardization,
augmentation and upgrading
Preservation of data from virus infection
Displacement of corpus data
Conversion of Corpus data across different formats
Adaptation of new hardware and software technology
Corpus linguistics

Corpus linguistics

  • 1.
    Mr Jitendra B.Patil Assistant Professor of English Pratap College Amalner Dist – Jalgaon (Maharshtra) Pin-425401 Mob.- 919421655091 Email- jitendrapca@gmail.com
  • 2.
     Corpus (Latin)means ‘body’  any body of text  new approach to language study  collects samples of text from various fields of language use in a scientific and systematic way  Corpus: a statistically sampled language database  Purposes: investigation, description, application, and analyses relevant to all branches of linguistics
  • 3.
    Indispensability of Corpusin Linguistics: Due to large structure, varied composition, huge information, confirmed referential authenticity, wide representation, easy usability and simple verifiability Usages: To verify earlier proposition and examples To verify logic of pre -proposed definitions and explanations
  • 4.
    Corpus in CorpusLinguistics: Holds special connotations A large collection of linguistic data used as a starting point of logistic description A body of language text in written and spoken form Represents varieties of language used at each and every field of human interaction Preserves in machine readable form Enables all kinds of linguistic description and analysis
  • 5.
    Corpus means alarge collection of texts assumed to be representative of a given language, dialect or other subset of language, to be used for linguistic analyses. Corpus is a large collection of pieces of language that are selected and ordered according to some explicit linguistic criteria in order to be used as samples of the language. Corpus is a large collection of naturally occurring language texts presented in machine-readable form accumulated in scientific manner to characterize a particular variety or use of language.
  • 6.
    A corpus, whichcontains constituent pieces of language that are documented as to their origin and provenance, is encoded in a standard and homogenous way for open- ended retrieval tasks. Linguistics have always used the word ‘Corpus’ to describe a collection of naturally occurring examples of language ,consisting of anything from a set of written text or tape recordings which have been collected for linguistic study.
  • 7.
    A corpus refersto : Any body of text A body of machine-readable text A finite collection of machine-readable texts which are sampled to maximally representative of language or language variety. Important Issues in Corpus Designing: Composition of a corpus Usage potential of a corpus
  • 8.
    A Corpus should- Faithfullyrepresent both common and special linguistic features of a language from where it is designed and developed Be large enough to encompass samples of text from various disciplines Be a true replica of physical texts Preserve various forms of words, punctuation marks, spellings, variations and other orthographic symbols used in the source text. Represent all linguistic usage varieties in a propositional manner Use authentic, referential and verifiable Text samples Enable user to use language data in multiple tasks Preserve texts in annotated and non-annotated form
  • 10.
    Quantity: No fixed parameter Thebigger the corpus ,the better its authenticity and reliability Data from a variety of sources in large quantity Refers to the sum of the total linguistic component included Electronic corpus generation contains millions of words
  • 11.
    Quality: Relates to authenticity Collectionfrom genuine communications Depends on ideal restriction of corpus collectors role Databases should be drawn from actual reality Interactional properties of casual and informal talks
  • 12.
    Representativeness: Proper representation ofa broad range of material Representative of maximum linguistic features Authentic in representation of text variety Maximally representative of demographical variables Overall size of corpus to be set against the diversity of sources Random selection of text samples
  • 13.
    Simplicity: Simple and plaintext samples Unbroken string of characters without any added information Separate Preservation of additional features Separate storage of Extralinguistic information
  • 14.
    Equality : Text samplewith equal number of words  balance between spoken text sample and written text sample Collection of equal amount of text from all sources Balance in case of quality of samples
  • 15.
    Retrievability : Easy Retrievabilityof data by end user Techniques and tools preserving data in electronic forma Accessibility for all
  • 16.
    Verifiability: Must be opento empirical verification Reflective of actual of patterns of language use Authentic and valid in synchronic and diachronic studies
  • 17.
    Augmentation: Changeable with time Canbe synchronic Can be diachronic
  • 18.
    Documentation : Separation ofdocumentary information from the components Meticulous documentation of extralinguistic information Easy retrieval of extralinguistic information (annotated info)
  • 19.
    Management : Necessary schemefor maintenance, standardization, augmentation and upgrading Preservation of data from virus infection Displacement of corpus data Conversion of Corpus data across different formats Adaptation of new hardware and software technology