Corpus Linguistics :Analytical Tools

Corpus Linguistics
Analytical Tools

Prepared By
Mr. Jitendra B. Patil
Assistant Professor of English
Pratap College Amalner
Dist – Jalgaon (Maharashtra)
Pin-425401 Mob.- 919421655091
Email- jitendrapca@gmail.com

 Used widely since1980
 Produced originally by Brigham Young University, Utah
 Can provide fast retrieval of large corpora
 Has two separate programs
 WC Index batch process – to index a text file or corpus
 Produces a series of annotated files
 Runs on plain ASCII file
 Early versions took about 20 minutes to index 100k files

 WC View runs as a menu to locate pre-indexed data
 Provides fast retrieval of all tokens of morphemes
 WC can provide many options for the amount of contexts
 From single to about fifty lines
 Good for rapid exploration of text
 Not flexible for sorting and formatting output of analyses

TACT
(TEXT ANALYSIS COMPUTING TOOLS)

 Research oriented software for corpus analyses
 Developed at University of Torranto
 First released in 1989
 a system of 15 programs for MS-DOS
 supports the extended ASCII character set of the IBM PC
 The TACT system is multilingual
 is designed to do text-retrieval and analysis on literary works

 is used to retrieve occurrences of a word, word pattern, or word combination
 Output-in the form of a concordance, a list, or a table
 can do simple kinds of analysis, such as sorted frequencies of letters, words
or phrases, type-token statistics
 is intended for individual literary texts, or small to mid-size groups of such
texts
 Processing a text with TACT normally begins with tagging or marking up an
ASCII copy of the text

 a text-editor to insert these tags, usually within diamond-bracket delimiters
 mark-up helps one to refine word-selections
 mark proper names (of people and places), episodes, date, location, audience,
narrative mode, theme, etc.
 four programs can be used: Preproc, Makedct, Tagtext, and Satdct, to add tags
to each word of the ASCII text
 with other font-editing tools, its capabilities can be extended to other modern
European languages, such as French, German, and Greek.

LEXA: Corpus Processing Software

 A set of programmes- to process linguistically relevant data
 is divided into several groups which perform typical functions
 the first of these-lexical analysis
 Lexa- allows one to tag and lemmatize any text or series of texts with a
minimum of effort.
 the user specifies what (possible) words are to be assigned to what lemmas
 flexibility in design is given highest priority

 flexibility:
 number of items- are user-determinable
 the structure of each programme as user-friendly

 a widely-used architecture for corpus analysis
 originally designed at the IMS, University of Stuttgart
 consists of a set of tools for indexing, managing and querying very large corpora
with multiple layers of word-level annotation.
 CWB’s central component - Corpus Query Processor (CQP)
 (CQP)-
 an extremely powerful and efficient concordance system implementing a
flexible two-level search

 (CQP)-allows complex query patterns to be specified
 at the level of an individual word or annotation
 at the level of a fully- or partially-specified pattern of tokens
 Several key improvements were made to the CWB core:
 (i) support for multiple character sets Unicode (in the form of UTF-8)
 (ii) support for powerful Perl-style regular expressions in CQP queries, based
on the open-source PCRE library
 (CQP)-allows complex query patterns to be specified

 at the level of an individual word or annotation
 at the level of a fully- or partially-specified pattern of tokens
 Several key improvements were made to the CWB core:
 (i) support for multiple character sets Unicode (in the form of UTF-8)
 (ii) support for powerful Perl-style regular expressions in CQP queries, based
on the open-source PCRE library
 (iv) support for larger corpus sizes of up to 2 billion words on 64-bit
platforms.

 CWB, the IMS Open Corpus Workbench, is somewhat misleadingly named
 as it is not in any sense a comprehensive or general “workbench” for corpus
linguistics
 Instead, it is a powerful and flexible system for indexing and searching corpus
Data
 CWB actually consists of three different software packages:
 (i) the CWB core, including the low-level Corpus Library (CL), the CWB
utilities, and the Corpus Query Processor (CQP)

 (ii) the CWB/Perl interface – itself divided into three separate Perl packages,
namely CWB,4 CWB-CL and CWB-Web
 (iii) CQP web: is the most recent addition

The type of computer-generated concordance produced by Micro Concord (the
KWIC, or "keyword-in-context" index) evolved in the late 1950s
Micro Concord searches the text of five plays in under a minute
a concordance program which has been developed specifically for the language
teacher/learner.
MicroConcord is a well-designed basic concordancer
useful for a variety of applications, and robustness and simplicity
Suitable for novices and for classroom use.

MicroConcord's user interface is simple and intuitive
the user specifies search word(s), a directory containing texts to be searched, and
the text files, with an option to select up to 500 files from 963 directories

Corpus Linguistics :Analytical Tools

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Corpus Linguistics :Analytical Tools

Similar to Corpus Linguistics :Analytical Tools (20)

Recently uploaded

Recently uploaded (20)

Corpus Linguistics :Analytical Tools