SlideShare a Scribd company logo
Introduction to Corpus
Linguistics
By Karimli Vuqar
What is Corpus?
• Definition
• Why are they used?
• What are they considered to be?
– Method vs Theory
• Types of corpora?
– Monolingual Vs. Multilingual
– Parallel Vs. Translated
Corpus Linguistics
• LC history
– 1960 1st generation e.g. Brown
– 1975 2nd generation e.g. Cobuild
– 1990 3rd generation e.g. BOE
• Roots
– CL and Linguistics
• Comparative linguistics
• Syntactics and semantics
– Chomskyan revolution
• Technology and the progress of CL
• Benefits of CL
• Problems of CL
Building the Corpora
• General corpora
– E.g. BNC, The Brown Corpus
• Specialized corpora
– How corpora is used (Written – Spoken)
– Materials for creating the corpora (newspapers – books –
documents etc.)
– General (Social – science – art ..etc)
• Multilingual corpora – Parallel corpora
• Learners corpora (International Corpus of Learner
English)
• Monitor Corpus (The Bank of English)
• Historical Corpus
Advantages and Disadvantages
• More reliable than intuition
• Language patterns are easily identified
• Deconstruct texts to discover patterns
• Track the development of specific features in the
history of English
• Test hypothesis on specific language features
empirically
• Follow language acquisition properly
• Draw conclusions on large amount of linguistic data
• Not always a complete picture
• Frequency rather than the possibility
CL terminology
• Concordance
– Where and in what context?
– Frequency
• Annotation
– Mark-up
• Tagging
– POS tagging
– Syntactic Treebank
– Semantic tagging
• Coding
• Metadata
Famous Corpora
Credits: Nadja Nesselhauf
Corpora and Translation
• Corpus translation studies (CTS)
• Descriptive translation
• Equivalence
• Corpus-based translation
• The process Vs the product
• The third code
• Simplification Vs normalization
Methods of Research in CL
• Quantitative
• Qualitative
– Context
• Quantitative and Qualitative
Corpus Software
• AntConc:
• MICASE: Michigan Corpus of Academic
Spoken English
• TACT: Text Analysis Computing Tools
• TACTWeb: a concordance program based on
TACT but for the Web
• SARA: the concordance program which is
specifically written for the British National
Corpus
Corpus Software Continued
• BNCweb
• BNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the
British National Corpus (BNC). It relies on the Corpus Query Processor (CQP) of the IMS Open Corpus Workbench
to provide a convenient interface between the user and the rich variety of annotated text in the 100-million word
BNC in its most recent incarnation, the XML-version.BNC Web Index
• This is the web front end to David Lee's BNC Index spreadsheet. For an introduction to BNC Index, please
see David's web site.CLAWS
• Part of speech tagging software for English.Clustertool
• Clustertool allows you to perform Hierarchical Agglomerative Cluster Analysis on your own data.CQPweb
• An extension of BNCweb but designed for use with any corpus.LL Calculator
• This calculates Log-Likelihood values from a 2x2 contingency table. LL is a more reliable alternative to the
standard Pearson's chi-squared test, see Dunning (1993).LWAC
• LWAC is a tool for constructing corpora from web data.Sentrick
• Stream-oriented Java library and a set of command line tools for high quality sentence boundary detection.
(Sentence segmentation / splitting / disambiguation). Currently has one model for German (trained on general text
and Wikipedia lynx dumps).SigTest
• Flexible Significance Test System: Chi-squared test, log-likelihood test and Fisher exact test for any kind of
contingency table, using RUSAS
• Semantic tagger developed for English and extended to Finnish and Russian.VARD
• Variant Detector software that facilitates the pre-processing of corpora for normalisation of spelling variation (e.g.
Early Modern English)Wmatrix
• A corpus comparison and annotation tool incorporating CLAWS and USAS in a web front end.
Additional Resources
• University of Lancaster Centre for Computer Corpus Research on Language (Summer
School) http://ucrel.lancs.ac.uk/
• McEnery, Tony, and Wilson, Andrew. Corpus Linguistics, 2nd ed. Edinburgh University Press,
2001.
• ESRC Centre for Corpus Approaches to Social Science (CASS) University of Lancaster
• Aston, Guy and Burnard, Lou. The BNC handbook: exploring the British National Corpus
with SARA. Edinburgh University Press, 1998.
• McEnery, Tony, and Wilson, Andrew. Corpus Linguistics, 2nd ed. Edinburgh University Press,
2001.
• Biber, Douglas, Conrad, Susan, and Reppen, Randi. Corpus Linguistics: Investigating
Language Structure and Use.CUP, 1998.
•
Questions/Comments

More Related Content

Similar to This presentation about corpus linguistics

Reborn Digital: coding text
Reborn Digital: coding textReborn Digital: coding text
Reborn Digital: coding textPip Willcox
 
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...WGBH Media Library and Archives
 
Open sonar martinreynaert
Open sonar martinreynaertOpen sonar martinreynaert
Open sonar martinreynaertCLARIAH
 
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...locloud
 
Using corpora in instruction
Using corpora in instructionUsing corpora in instruction
Using corpora in instructionJonathan Smart
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguisticsIrum Malik
 
NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...
NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...
NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...Juliya Borie
 
Corpus based translation Studies
Corpus based translation StudiesCorpus based translation Studies
Corpus based translation StudiesHabib Ali
 
eMargin at #tagginganna workshop, Leicester
eMargin at #tagginganna workshop, LeicestereMargin at #tagginganna workshop, Leicester
eMargin at #tagginganna workshop, LeicesterRDUES
 
Improving Description through Collaboration: The Ethnomusicological Video for...
Improving Description through Collaboration: The Ethnomusicological Video for...Improving Description through Collaboration: The Ethnomusicological Video for...
Improving Description through Collaboration: The Ethnomusicological Video for...Jenn Riley
 
Valentine Charles: Linking cultural heritage with KOS: the Europeana example
Valentine Charles: Linking cultural heritage with KOS: the Europeana example Valentine Charles: Linking cultural heritage with KOS: the Europeana example
Valentine Charles: Linking cultural heritage with KOS: the Europeana example COST Action TD1210
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Dag Endresen
 
eMargin Presentation given to Skills Funding Agency
eMargin Presentation given to Skills Funding AgencyeMargin Presentation given to Skills Funding Agency
eMargin Presentation given to Skills Funding AgencyRDUES
 
Sinmin Literature Review Presentation
Sinmin Literature Review PresentationSinmin Literature Review Presentation
Sinmin Literature Review PresentationChamila Wijayarathna
 
Building Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped LearningBuilding Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped LearningSaint Michael's College
 
A Comparative Kalendar - DH2013 Presentation
A Comparative Kalendar - DH2013 PresentationA Comparative Kalendar - DH2013 Presentation
A Comparative Kalendar - DH2013 Presentationblalbritton
 
Language commons wiki_final
Language commons wiki_finalLanguage commons wiki_final
Language commons wiki_finalEd Bice
 

Similar to This presentation about corpus linguistics (20)

Reborn Digital: coding text
Reborn Digital: coding textReborn Digital: coding text
Reborn Digital: coding text
 
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
 
Open sonar martinreynaert
Open sonar martinreynaertOpen sonar martinreynaert
Open sonar martinreynaert
 
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
 
IMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory CraneIMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory Crane
 
Using corpora in instruction
Using corpora in instructionUsing corpora in instruction
Using corpora in instruction
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
RusLTC at TSD-2014 (Brno)
RusLTC at TSD-2014 (Brno)RusLTC at TSD-2014 (Brno)
RusLTC at TSD-2014 (Brno)
 
NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...
NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...
NASIG Webinar 2014 "From Record-Bound to Boundless: FRBR, Linked Data and New...
 
Corpus based translation Studies
Corpus based translation StudiesCorpus based translation Studies
Corpus based translation Studies
 
eMargin at #tagginganna workshop, Leicester
eMargin at #tagginganna workshop, LeicestereMargin at #tagginganna workshop, Leicester
eMargin at #tagginganna workshop, Leicester
 
Interverbum falcon-10oct14-az
Interverbum falcon-10oct14-azInterverbum falcon-10oct14-az
Interverbum falcon-10oct14-az
 
Improving Description through Collaboration: The Ethnomusicological Video for...
Improving Description through Collaboration: The Ethnomusicological Video for...Improving Description through Collaboration: The Ethnomusicological Video for...
Improving Description through Collaboration: The Ethnomusicological Video for...
 
Valentine Charles: Linking cultural heritage with KOS: the Europeana example
Valentine Charles: Linking cultural heritage with KOS: the Europeana example Valentine Charles: Linking cultural heritage with KOS: the Europeana example
Valentine Charles: Linking cultural heritage with KOS: the Europeana example
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...
 
eMargin Presentation given to Skills Funding Agency
eMargin Presentation given to Skills Funding AgencyeMargin Presentation given to Skills Funding Agency
eMargin Presentation given to Skills Funding Agency
 
Sinmin Literature Review Presentation
Sinmin Literature Review PresentationSinmin Literature Review Presentation
Sinmin Literature Review Presentation
 
Building Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped LearningBuilding Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped Learning
 
A Comparative Kalendar - DH2013 Presentation
A Comparative Kalendar - DH2013 PresentationA Comparative Kalendar - DH2013 Presentation
A Comparative Kalendar - DH2013 Presentation
 
Language commons wiki_final
Language commons wiki_finalLanguage commons wiki_final
Language commons wiki_final
 

Recently uploaded

Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...Denish Jangid
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resourcesdimpy50
 
Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesRased Khan
 
Salient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxSalient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxakshayaramakrishnan21
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
 
2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptxmansk2
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345beazzy04
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...Sayali Powar
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...Nguyen Thanh Tu Collection
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTechSoup
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsCol Mukteshwar Prasad
 
Morse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxMorse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxjmorse8
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaasiemaillard
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleCeline George
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfbu07226
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
 
Open Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointOpen Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointELaRue0
 

Recently uploaded (20)

Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resources
 
B.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdfB.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdf
 
Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matrices
 
Salient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxSalient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptx
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Morse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxMorse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptx
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Open Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointOpen Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPoint
 

This presentation about corpus linguistics

  • 2. What is Corpus? • Definition • Why are they used? • What are they considered to be? – Method vs Theory • Types of corpora? – Monolingual Vs. Multilingual – Parallel Vs. Translated
  • 3. Corpus Linguistics • LC history – 1960 1st generation e.g. Brown – 1975 2nd generation e.g. Cobuild – 1990 3rd generation e.g. BOE • Roots – CL and Linguistics • Comparative linguistics • Syntactics and semantics – Chomskyan revolution • Technology and the progress of CL • Benefits of CL • Problems of CL
  • 4. Building the Corpora • General corpora – E.g. BNC, The Brown Corpus • Specialized corpora – How corpora is used (Written – Spoken) – Materials for creating the corpora (newspapers – books – documents etc.) – General (Social – science – art ..etc) • Multilingual corpora – Parallel corpora • Learners corpora (International Corpus of Learner English) • Monitor Corpus (The Bank of English) • Historical Corpus
  • 5. Advantages and Disadvantages • More reliable than intuition • Language patterns are easily identified • Deconstruct texts to discover patterns • Track the development of specific features in the history of English • Test hypothesis on specific language features empirically • Follow language acquisition properly • Draw conclusions on large amount of linguistic data • Not always a complete picture • Frequency rather than the possibility
  • 6. CL terminology • Concordance – Where and in what context? – Frequency • Annotation – Mark-up • Tagging – POS tagging – Syntactic Treebank – Semantic tagging • Coding • Metadata
  • 8. Corpora and Translation • Corpus translation studies (CTS) • Descriptive translation • Equivalence • Corpus-based translation • The process Vs the product • The third code • Simplification Vs normalization
  • 9. Methods of Research in CL • Quantitative • Qualitative – Context • Quantitative and Qualitative
  • 10. Corpus Software • AntConc: • MICASE: Michigan Corpus of Academic Spoken English • TACT: Text Analysis Computing Tools • TACTWeb: a concordance program based on TACT but for the Web • SARA: the concordance program which is specifically written for the British National Corpus
  • 11. Corpus Software Continued • BNCweb • BNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the British National Corpus (BNC). It relies on the Corpus Query Processor (CQP) of the IMS Open Corpus Workbench to provide a convenient interface between the user and the rich variety of annotated text in the 100-million word BNC in its most recent incarnation, the XML-version.BNC Web Index • This is the web front end to David Lee's BNC Index spreadsheet. For an introduction to BNC Index, please see David's web site.CLAWS • Part of speech tagging software for English.Clustertool • Clustertool allows you to perform Hierarchical Agglomerative Cluster Analysis on your own data.CQPweb • An extension of BNCweb but designed for use with any corpus.LL Calculator • This calculates Log-Likelihood values from a 2x2 contingency table. LL is a more reliable alternative to the standard Pearson's chi-squared test, see Dunning (1993).LWAC • LWAC is a tool for constructing corpora from web data.Sentrick • Stream-oriented Java library and a set of command line tools for high quality sentence boundary detection. (Sentence segmentation / splitting / disambiguation). Currently has one model for German (trained on general text and Wikipedia lynx dumps).SigTest • Flexible Significance Test System: Chi-squared test, log-likelihood test and Fisher exact test for any kind of contingency table, using RUSAS • Semantic tagger developed for English and extended to Finnish and Russian.VARD • Variant Detector software that facilitates the pre-processing of corpora for normalisation of spelling variation (e.g. Early Modern English)Wmatrix • A corpus comparison and annotation tool incorporating CLAWS and USAS in a web front end.
  • 12. Additional Resources • University of Lancaster Centre for Computer Corpus Research on Language (Summer School) http://ucrel.lancs.ac.uk/ • McEnery, Tony, and Wilson, Andrew. Corpus Linguistics, 2nd ed. Edinburgh University Press, 2001. • ESRC Centre for Corpus Approaches to Social Science (CASS) University of Lancaster • Aston, Guy and Burnard, Lou. The BNC handbook: exploring the British National Corpus with SARA. Edinburgh University Press, 1998. • McEnery, Tony, and Wilson, Andrew. Corpus Linguistics, 2nd ed. Edinburgh University Press, 2001. • Biber, Douglas, Conrad, Susan, and Reppen, Randi. Corpus Linguistics: Investigating Language Structure and Use.CUP, 1998. •