Detailed presentation on various analytical tools widely used in Corpus Linguistics for corpora analysis including WORDCRUNCHER, LEXA, CWB , TACT, MICROCONCORD etc.
2. Prepared By
Mr. Jitendra B. Patil
Assistant Professor of English
Pratap College Amalner
Dist – Jalgaon (Maharashtra)
Pin-425401 Mob.- 919421655091
Email- jitendrapca@gmail.com
4. Used widely since1980
Produced originally by Brigham Young University, Utah
Can provide fast retrieval of large corpora
Has two separate programs
WC Index batch process – to index a text file or corpus
Produces a series of annotated files
Runs on plain ASCII file
Early versions took about 20 minutes to index 100k files
5. WC View runs as a menu to locate pre-indexed data
Provides fast retrieval of all tokens of morphemes
WC can provide many options for the amount of contexts
From single to about fifty lines
Good for rapid exploration of text
Not flexible for sorting and formatting output of analyses
7. Research oriented software for corpus analyses
Developed at University of Torranto
First released in 1989
a system of 15 programs for MS-DOS
supports the extended ASCII character set of the IBM PC
The TACT system is multilingual
is designed to do text-retrieval and analysis on literary works
8. is used to retrieve occurrences of a word, word pattern, or word combination
Output-in the form of a concordance, a list, or a table
can do simple kinds of analysis, such as sorted frequencies of letters, words
or phrases, type-token statistics
is intended for individual literary texts, or small to mid-size groups of such
texts
Processing a text with TACT normally begins with tagging or marking up an
ASCII copy of the text
9. a text-editor to insert these tags, usually within diamond-bracket delimiters
mark-up helps one to refine word-selections
mark proper names (of people and places), episodes, date, location, audience,
narrative mode, theme, etc.
four programs can be used: Preproc, Makedct, Tagtext, and Satdct, to add tags
to each word of the ASCII text
with other font-editing tools, its capabilities can be extended to other modern
European languages, such as French, German, and Greek.
11. A set of programmes- to process linguistically relevant data
is divided into several groups which perform typical functions
the first of these-lexical analysis
Lexa- allows one to tag and lemmatize any text or series of texts with a
minimum of effort.
the user specifies what (possible) words are to be assigned to what lemmas
flexibility in design is given highest priority
12. flexibility:
number of items- are user-determinable
the structure of each programme as user-friendly
14. a widely-used architecture for corpus analysis
originally designed at the IMS, University of Stuttgart
consists of a set of tools for indexing, managing and querying very large corpora
with multiple layers of word-level annotation.
CWB’s central component - Corpus Query Processor (CQP)
(CQP)-
an extremely powerful and efficient concordance system implementing a
flexible two-level search
15. (CQP)-allows complex query patterns to be specified
at the level of an individual word or annotation
at the level of a fully- or partially-specified pattern of tokens
Several key improvements were made to the CWB core:
(i) support for multiple character sets Unicode (in the form of UTF-8)
(ii) support for powerful Perl-style regular expressions in CQP queries, based
on the open-source PCRE library
(CQP)-allows complex query patterns to be specified
16. at the level of an individual word or annotation
at the level of a fully- or partially-specified pattern of tokens
Several key improvements were made to the CWB core:
(i) support for multiple character sets Unicode (in the form of UTF-8)
(ii) support for powerful Perl-style regular expressions in CQP queries, based
on the open-source PCRE library
(iv) support for larger corpus sizes of up to 2 billion words on 64-bit
platforms.
17. CWB, the IMS Open Corpus Workbench, is somewhat misleadingly named
as it is not in any sense a comprehensive or general “workbench” for corpus
linguistics
Instead, it is a powerful and flexible system for indexing and searching corpus
Data
CWB actually consists of three different software packages:
(i) the CWB core, including the low-level Corpus Library (CL), the CWB
utilities, and the Corpus Query Processor (CQP)
18. (ii) the CWB/Perl interface – itself divided into three separate Perl packages,
namely CWB,4 CWB-CL and CWB-Web
(iii) CQP web: is the most recent addition
20. The type of computer-generated concordance produced by Micro Concord (the
KWIC, or "keyword-in-context" index) evolved in the late 1950s
Micro Concord searches the text of five plays in under a minute
a concordance program which has been developed specifically for the language
teacher/learner.
MicroConcord is a well-designed basic concordancer
useful for a variety of applications, and robustness and simplicity
Suitable for novices and for classroom use.
21. MicroConcord's user interface is simple and intuitive
the user specifies search word(s), a directory containing texts to be searched, and
the text files, with an option to select up to 500 files from 963 directories