SlideShare a Scribd company logo
1 of 30
Corpus! Thy Name is
Knowledge Discovery
Dr Zafar Ullah
zafarullah76@gmail.com
Scholarly Work 02. Code. 0090
1
Why?
★Close reading vs Distant Reading
★What to do with million books?
★Big data
★Databases
★Josephine Miles, Roberto Busa
2
Corpus and Approaches
★Why “analyzing and computing patterns
of linguistic form, meaning” (p.195)
★qualitative and quantitative linguistic
analyses
★Corpus based
★Corpus driven
3
Types of Corpus
1. Monitor Corpus
(COCA, BOE)
2. Parallel Corpus (translation)
(Open Source Parallel Corpus; English
Norwegian Parallel Corpus)
3. Comparable Corpus
(ICE International Corpus of English)
4. Diachronic Corpus
(Helsinki Corpus of English Texts; COCA, Time
Magazine) 4
Cont…
5. Specialized Corpus
(air controller traffic speech; student essays; Early
Modern English Tracts)
6. Multimedia Corpus
(SACODEYL)
7. Representative, Balanced Corpus
(100 million words; 1 million spoken and 1 million
written)
5
Files and Free Conversion
★ txt, word, Pdf, epub, csv, json, excel,
mobireader, mp3, FLV, JPG, Zip, docs,
HTML, RTF etc
★ Free Online File Converter
https://www.online-convert.com/
6
Corpus, Linguist, Corpus Tool
7
Web Links of Corpora
• https://www.english-corpora.org/
8
Corpus (online access) Download # words Dialect Time period Genre(s)
iWeb: The Intelligent Web-based
Corpus
14 billion 6 countries 2017 Web
News on the Web (NOW) 11.1 billion+ 20 countries 2010-yesterday Web: News
Global Web-Based English (GloWbE) 1.9 billion 20 countries 2012-13 Web (incl blogs)
Wikipedia Corpus 1.9 billion (Various) 2014 Wikipedia
Corpus of Contemporary American
English (COCA)
1.0 billion American 1990-2019 Balanced
Coronavirus Corpus 626 million+ 20 countries Jan 2020-yesterday Web: News
Corpus of Historical American
English (COHA)
400 million American 1810-2009 Balanced
The TV Corpus 325 million 6 countries 1950-2018 TV shows
The Movie Corpus 200 million 6 countries 1930-2018 Movies
Corpus of American Soap Operas 100 million American 2001-2012 TV shows
Some Other Corpora
Hansard Corpus 1.6 billion British 1803-2005 Parliament
Early English Books Online 755 million British 1470s-1690s (Various)
Corpus of US Supreme
Court Opinions
130 million American 1790s-present Legal opinions
TIME Magazine Corpus 100 million American 1923-2006 Magazine
British National Corpus
(BNC) *
100 million British 1980s-1993 Balanced
Strathy Corpus (Canada) 50 million Canadian 1970s-2000s Balanced
CORE Corpus 50 million 6 countries 2014 Web
From Google Books n-
grams (compare)
American English 155 billion American 1500s-2000s (Various)
British English 34 billion British 1500s-2000 (Various)
9
Free Databases
★ Mendeley Data
https://data.mendeley.com/research-data/
★ Google dataset
https://datasetsearch.research.google.com/
★ UCI Library
https://archive.ics.uci.edu/ml/datasets.php
★ Europe Data
https://data.europa.eu/en
★ Kaggle
https://www.kaggle.com/datasets
10
Traditional Corpus Tools
★ AntConc
https://www.laurenceanthony.net/software/antconc
/
★ WordSmith Tools
https://www.lexically.net/wordsmith/
★ Sketchengine
https://www.sketchengine.eu/
★ LancsBox
http://corpora.lancs.ac.uk/lancsbox/ 11
Knowledge Discovery Theory
1.“In active data mining paradigm,… rules are discovered, … we
describe the constructs for defining shapes, and discuss how the shape
predicates are used in a query construct” (Agrawal, & Psaila, 1995).
KDD process was a “set of various activities for making sense of data”
(Fayyad, Piatetsky-Shapiro, & Smyth, 1996, p. 82).
2.“The extraction of implicit, previously unknown and potentially
useful information from data” (Cabena, Hadjinian, Stadler, Verhees, &
Zanasi, 1998, p. 9; Witten, Frank, & Hall, 2011).
12
Literature and Corpus
● Voyant Tools: https://voyant-tools.org/
● Z library https://pk1lib.org/
● Guttenberg Project
https://www.gutenberg.org/
● Hathitrust https://www.hathitrust.org/
13
Corpus Application in Language
★ Voyant Tools: https://voyant-
tools.org/
★ Themes, Phrases (Collocation, colligations,
and collostructions) , Knowledge Graphs,
Corpus Summary, WSD, genre analysis
★ POS Tagger Tool
https://parts-of-speech.info/
14
cont…
★Hedges
★Simile vs verb (like)
★Stylistics
★Forensic
★Coinage, lexicon
★ Language change and shift
★ Register and semantic change
★Interjections
15
Knowledge Discovery in Culture
★ Google Ngram viewer and cultural diversity
https://books.google.com/ngrams
★ Cultural shades of owl, dog, donkey
★ Comparison of cultures
★ Sports, travel, religion, education, military terms
★ Swearing terms
★ Kinship terminologies
★ Feelings about nationalities, religions
★ Behaviours with LGBT in different cultures
16
Text Mining
★ Voyant
https://voyant-tools.org/
★ Gephi
https://gephi.org/
★ Orange
https://orangedatamining.com/
★ Sentiment Analysis Tool: Text2Data
https://text2data.com/
17
Image Mining
★ 3D modelling
★ 3D printing
★ Atlas-ti tool (image mining)
https://atlasti.com/
18
Audio Mining and Phonetics
★ Dialectology
★ Speech to text
★ Speech to transcription
★ Praat tool
https://www.fon.hum.uva.nl/praat/
19
Video Mining
★ Atlas-ti tool (audio, video, image, social
media) Qualitative data analysis
https://atlasti.com/
★ Free trial
★ Register
★ Click: Select new project
★ Click: Documents
★ Analysis,
★ share/ export
20
Geo-Spatial Data Mining
★ Instant Streetview
https://www.instantstreetview.com/
★ Blender GIS
★ https://sourceforge.net/projects/blender-gis.mirror/
★ ArcGIS
★ https://www.arcgis.com/index.html 21
Social Media Mining
SocialSeacher tool
https://www.social-searcher.com/
22
Knowledge Discovery in Pakistani
Languages
E-Library Punjab
https://elibrary.punjab.gov.pk/e_books
REKHTA
SUFINAMA
https://sufinama.org/poets/khwaja-ghulam-farid/all
Voyant
https://voyant-tools.org/
23
Museology and Digital Humanities
★ Thousands of Virtual Museums
https://mcn.edu/a-guide-to-virtual-museum-resources/
24
Musicology and Corpus
★ Music Dataset: Lyrics and Metadata
from 1950 to 2019
https://data.mendeley.com/datasets/3t9vbwxgr5/2
★ Music Dataset : 1950 to 2019
https://www.kaggle.com/datasets/saurabhshahane/music
-dataset-1950-to-2019
25
Humanoid Robotics and Corpus
26
Quantum Humanities
★ Speed
★ Complexity
★ Many variables
27
Access Video and Slides
• Video:
• Slides:
28
Thanks
29
Question Answers
30

More Related Content

Similar to Corpus Knowledge Discovery Techniques

Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Andrej Muhic
 
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...Vladimir Alexiev, PhD, PMP
 
Diachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramDiachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramAnnalina Caputo
 
Bible translation in today's world part 2 2-4-15 p pt slides org
Bible translation in today's world   part 2  2-4-15 p pt slides orgBible translation in today's world   part 2  2-4-15 p pt slides org
Bible translation in today's world part 2 2-4-15 p pt slides orgWalt Hamilton
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Cornelius Puschmann
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technologytechiaith
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
 
LIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic PatternsLIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic PatternsShalin Hai-Jew
 
Using corpora in instruction
Using corpora in instructionUsing corpora in instruction
Using corpora in instructionJonathan Smart
 
Global Information Architecture Workshop
Global Information Architecture WorkshopGlobal Information Architecture Workshop
Global Information Architecture WorkshopPeter Van Dijck
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Marieke van Erp
 
Compiling a Monolingual Dictionary for Native Speakers
Compiling a Monolingual Dictionary for Native SpeakersCompiling a Monolingual Dictionary for Native Speakers
Compiling a Monolingual Dictionary for Native Speakersmostlyharmless
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportAlexandre Rademaker
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Roi Blanco
 

Similar to Corpus Knowledge Discovery Techniques (20)

Diachronic Analysis
Diachronic AnalysisDiachronic Analysis
Diachronic Analysis
 
Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic
 
Corpus
CorpusCorpus
Corpus
 
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...
 
IMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien DepuydtIMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien Depuydt
 
Diachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramDiachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google Ngram
 
Bible translation in today's world part 2 2-4-15 p pt slides org
Bible translation in today's world   part 2  2-4-15 p pt slides orgBible translation in today's world   part 2  2-4-15 p pt slides org
Bible translation in today's world part 2 2-4-15 p pt slides org
 
Studying Migrations Routes: New data and Tools
Studying Migrations Routes: New data and ToolsStudying Migrations Routes: New data and Tools
Studying Migrations Routes: New data and Tools
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technology
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 
LIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic PatternsLIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic Patterns
 
Using corpora in instruction
Using corpora in instructionUsing corpora in instruction
Using corpora in instruction
 
Global Information Architecture Workshop
Global Information Architecture WorkshopGlobal Information Architecture Workshop
Global Information Architecture Workshop
 
Publishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmosPublishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmos
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing
 
Compiling a Monolingual Dictionary for Native Speakers
Compiling a Monolingual Dictionary for Native SpeakersCompiling a Monolingual Dictionary for Native Speakers
Compiling a Monolingual Dictionary for Native Speakers
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project Report
 
JRC-Names - EC - Diplohack Datamarket
JRC-Names - EC - Diplohack DatamarketJRC-Names - EC - Diplohack Datamarket
JRC-Names - EC - Diplohack Datamarket
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
 

More from University of Education, Lahore

Scholarly Work 08. Lecture on Tolerance vs Intolerance. Code 702.ppt
Scholarly Work 08. Lecture on Tolerance vs Intolerance. Code 702.pptScholarly Work 08. Lecture on Tolerance vs Intolerance. Code 702.ppt
Scholarly Work 08. Lecture on Tolerance vs Intolerance. Code 702.pptUniversity of Education, Lahore
 
Scholarly Work 07. Cultural Heritage Lecture. Code 701..pptx
Scholarly Work 07. Cultural Heritage Lecture. Code 701..pptxScholarly Work 07. Cultural Heritage Lecture. Code 701..pptx
Scholarly Work 07. Cultural Heritage Lecture. Code 701..pptxUniversity of Education, Lahore
 
Scholarly Work 03. Language Acquisition and Learning Theories (Code 301).pptx
Scholarly Work 03. Language Acquisition and Learning Theories (Code 301).pptxScholarly Work 03. Language Acquisition and Learning Theories (Code 301).pptx
Scholarly Work 03. Language Acquisition and Learning Theories (Code 301).pptxUniversity of Education, Lahore
 
O level English 05. How to Write a Narrative or Story.pptx
O level English 05. How to Write a Narrative or Story.pptxO level English 05. How to Write a Narrative or Story.pptx
O level English 05. How to Write a Narrative or Story.pptxUniversity of Education, Lahore
 
Research 35. How to Prepare Your File for Turnitin. Code. 100.pptx
Research 35.  How to Prepare Your File for Turnitin. Code. 100.pptxResearch 35.  How to Prepare Your File for Turnitin. Code. 100.pptx
Research 35. How to Prepare Your File for Turnitin. Code. 100.pptxUniversity of Education, Lahore
 
O level English 01. Syllabus, Paper Pattern. (Code 101).pptx
O level English 01. Syllabus, Paper Pattern. (Code 101).pptxO level English 01. Syllabus, Paper Pattern. (Code 101).pptx
O level English 01. Syllabus, Paper Pattern. (Code 101).pptxUniversity of Education, Lahore
 
Research 34. MDPI, SSCI, AHCI Publishers Code. 0083.pptx
Research 34. MDPI, SSCI, AHCI Publishers Code. 0083.pptxResearch 34. MDPI, SSCI, AHCI Publishers Code. 0083.pptx
Research 34. MDPI, SSCI, AHCI Publishers Code. 0083.pptxUniversity of Education, Lahore
 
Research 33. How to Find HEC Recognized Journal. Code. 0080.pptx
Research 33. How to Find HEC Recognized Journal. Code. 0080.pptxResearch 33. How to Find HEC Recognized Journal. Code. 0080.pptx
Research 33. How to Find HEC Recognized Journal. Code. 0080.pptxUniversity of Education, Lahore
 
Knowledge 03. How to Write a Book Review (Code.0063).pptx
Knowledge 03. How to Write a Book Review (Code.0063).pptxKnowledge 03. How to Write a Book Review (Code.0063).pptx
Knowledge 03. How to Write a Book Review (Code.0063).pptxUniversity of Education, Lahore
 
Research 32. Code. 0055. Use Turnitin Report to Remove Plagiarism
Research 32.  Code. 0055. Use Turnitin Report to Remove PlagiarismResearch 32.  Code. 0055. Use Turnitin Report to Remove Plagiarism
Research 32. Code. 0055. Use Turnitin Report to Remove PlagiarismUniversity of Education, Lahore
 
Research 31. Code. 0054. Beautiful Research Phrases (2nd Part)
Research 31.  Code. 0054. Beautiful Research Phrases (2nd Part)Research 31.  Code. 0054. Beautiful Research Phrases (2nd Part)
Research 31. Code. 0054. Beautiful Research Phrases (2nd Part)University of Education, Lahore
 
Research 30. Code. 0053. Beautiful Research Phrases (1st part)
Research 30.  Code. 0053. Beautiful Research Phrases (1st part)Research 30.  Code. 0053. Beautiful Research Phrases (1st part)
Research 30. Code. 0053. Beautiful Research Phrases (1st part)University of Education, Lahore
 

More from University of Education, Lahore (20)

Lesson Planning.ppt
Lesson Planning.pptLesson Planning.ppt
Lesson Planning.ppt
 
Scholarly Work 08. Lecture on Tolerance vs Intolerance. Code 702.ppt
Scholarly Work 08. Lecture on Tolerance vs Intolerance. Code 702.pptScholarly Work 08. Lecture on Tolerance vs Intolerance. Code 702.ppt
Scholarly Work 08. Lecture on Tolerance vs Intolerance. Code 702.ppt
 
Scholarly Work 07. Cultural Heritage Lecture. Code 701..pptx
Scholarly Work 07. Cultural Heritage Lecture. Code 701..pptxScholarly Work 07. Cultural Heritage Lecture. Code 701..pptx
Scholarly Work 07. Cultural Heritage Lecture. Code 701..pptx
 
Research 36. How to Write Significance. Code.601.pptx
Research 36. How to Write Significance.  Code.601.pptxResearch 36. How to Write Significance.  Code.601.pptx
Research 36. How to Write Significance. Code.601.pptx
 
Scholarly Work 04. Cross Cultural Pragmatics.ppt
Scholarly Work 04.  Cross Cultural Pragmatics.pptScholarly Work 04.  Cross Cultural Pragmatics.ppt
Scholarly Work 04. Cross Cultural Pragmatics.ppt
 
Scholarly Work 03. Language Acquisition and Learning Theories (Code 301).pptx
Scholarly Work 03. Language Acquisition and Learning Theories (Code 301).pptxScholarly Work 03. Language Acquisition and Learning Theories (Code 301).pptx
Scholarly Work 03. Language Acquisition and Learning Theories (Code 301).pptx
 
O level English 06. Report Writing (code 106).pptx
O level English 06. Report Writing (code 106).pptxO level English 06. Report Writing (code 106).pptx
O level English 06. Report Writing (code 106).pptx
 
O level English 05. How to Write a Narrative or Story.pptx
O level English 05. How to Write a Narrative or Story.pptxO level English 05. How to Write a Narrative or Story.pptx
O level English 05. How to Write a Narrative or Story.pptx
 
O level English 03- Essay Writing new.pptx
O level English 03- Essay Writing new.pptxO level English 03- Essay Writing new.pptx
O level English 03- Essay Writing new.pptx
 
Research 35. How to Prepare Your File for Turnitin. Code. 100.pptx
Research 35.  How to Prepare Your File for Turnitin. Code. 100.pptxResearch 35.  How to Prepare Your File for Turnitin. Code. 100.pptx
Research 35. How to Prepare Your File for Turnitin. Code. 100.pptx
 
O level English 04. Speech Writing. Code. 104. pptx
O level English 04. Speech Writing. Code. 104. pptxO level English 04. Speech Writing. Code. 104. pptx
O level English 04. Speech Writing. Code. 104. pptx
 
O level English 01. Syllabus, Paper Pattern. (Code 101).pptx
O level English 01. Syllabus, Paper Pattern. (Code 101).pptxO level English 01. Syllabus, Paper Pattern. (Code 101).pptx
O level English 01. Syllabus, Paper Pattern. (Code 101).pptx
 
Research 34. MDPI, SSCI, AHCI Publishers Code. 0083.pptx
Research 34. MDPI, SSCI, AHCI Publishers Code. 0083.pptxResearch 34. MDPI, SSCI, AHCI Publishers Code. 0083.pptx
Research 34. MDPI, SSCI, AHCI Publishers Code. 0083.pptx
 
Research 33. How to Find HEC Recognized Journal. Code. 0080.pptx
Research 33. How to Find HEC Recognized Journal. Code. 0080.pptxResearch 33. How to Find HEC Recognized Journal. Code. 0080.pptx
Research 33. How to Find HEC Recognized Journal. Code. 0080.pptx
 
Functional English 30. Conditional Sentences.ppt
Functional English 30. Conditional Sentences.pptFunctional English 30. Conditional Sentences.ppt
Functional English 30. Conditional Sentences.ppt
 
Digital Humanities 01. Wordmith Tools. Code. 0066.pptx
Digital Humanities 01. Wordmith Tools. Code. 0066.pptxDigital Humanities 01. Wordmith Tools. Code. 0066.pptx
Digital Humanities 01. Wordmith Tools. Code. 0066.pptx
 
Knowledge 03. How to Write a Book Review (Code.0063).pptx
Knowledge 03. How to Write a Book Review (Code.0063).pptxKnowledge 03. How to Write a Book Review (Code.0063).pptx
Knowledge 03. How to Write a Book Review (Code.0063).pptx
 
Research 32. Code. 0055. Use Turnitin Report to Remove Plagiarism
Research 32.  Code. 0055. Use Turnitin Report to Remove PlagiarismResearch 32.  Code. 0055. Use Turnitin Report to Remove Plagiarism
Research 32. Code. 0055. Use Turnitin Report to Remove Plagiarism
 
Research 31. Code. 0054. Beautiful Research Phrases (2nd Part)
Research 31.  Code. 0054. Beautiful Research Phrases (2nd Part)Research 31.  Code. 0054. Beautiful Research Phrases (2nd Part)
Research 31. Code. 0054. Beautiful Research Phrases (2nd Part)
 
Research 30. Code. 0053. Beautiful Research Phrases (1st part)
Research 30.  Code. 0053. Beautiful Research Phrases (1st part)Research 30.  Code. 0053. Beautiful Research Phrases (1st part)
Research 30. Code. 0053. Beautiful Research Phrases (1st part)
 

Recently uploaded

Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 

Recently uploaded (20)

Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 

Corpus Knowledge Discovery Techniques

  • 1. Corpus! Thy Name is Knowledge Discovery Dr Zafar Ullah zafarullah76@gmail.com Scholarly Work 02. Code. 0090 1
  • 2. Why? ★Close reading vs Distant Reading ★What to do with million books? ★Big data ★Databases ★Josephine Miles, Roberto Busa 2
  • 3. Corpus and Approaches ★Why “analyzing and computing patterns of linguistic form, meaning” (p.195) ★qualitative and quantitative linguistic analyses ★Corpus based ★Corpus driven 3
  • 4. Types of Corpus 1. Monitor Corpus (COCA, BOE) 2. Parallel Corpus (translation) (Open Source Parallel Corpus; English Norwegian Parallel Corpus) 3. Comparable Corpus (ICE International Corpus of English) 4. Diachronic Corpus (Helsinki Corpus of English Texts; COCA, Time Magazine) 4
  • 5. Cont… 5. Specialized Corpus (air controller traffic speech; student essays; Early Modern English Tracts) 6. Multimedia Corpus (SACODEYL) 7. Representative, Balanced Corpus (100 million words; 1 million spoken and 1 million written) 5
  • 6. Files and Free Conversion ★ txt, word, Pdf, epub, csv, json, excel, mobireader, mp3, FLV, JPG, Zip, docs, HTML, RTF etc ★ Free Online File Converter https://www.online-convert.com/ 6
  • 8. Web Links of Corpora • https://www.english-corpora.org/ 8 Corpus (online access) Download # words Dialect Time period Genre(s) iWeb: The Intelligent Web-based Corpus 14 billion 6 countries 2017 Web News on the Web (NOW) 11.1 billion+ 20 countries 2010-yesterday Web: News Global Web-Based English (GloWbE) 1.9 billion 20 countries 2012-13 Web (incl blogs) Wikipedia Corpus 1.9 billion (Various) 2014 Wikipedia Corpus of Contemporary American English (COCA) 1.0 billion American 1990-2019 Balanced Coronavirus Corpus 626 million+ 20 countries Jan 2020-yesterday Web: News Corpus of Historical American English (COHA) 400 million American 1810-2009 Balanced The TV Corpus 325 million 6 countries 1950-2018 TV shows The Movie Corpus 200 million 6 countries 1930-2018 Movies Corpus of American Soap Operas 100 million American 2001-2012 TV shows
  • 9. Some Other Corpora Hansard Corpus 1.6 billion British 1803-2005 Parliament Early English Books Online 755 million British 1470s-1690s (Various) Corpus of US Supreme Court Opinions 130 million American 1790s-present Legal opinions TIME Magazine Corpus 100 million American 1923-2006 Magazine British National Corpus (BNC) * 100 million British 1980s-1993 Balanced Strathy Corpus (Canada) 50 million Canadian 1970s-2000s Balanced CORE Corpus 50 million 6 countries 2014 Web From Google Books n- grams (compare) American English 155 billion American 1500s-2000s (Various) British English 34 billion British 1500s-2000 (Various) 9
  • 10. Free Databases ★ Mendeley Data https://data.mendeley.com/research-data/ ★ Google dataset https://datasetsearch.research.google.com/ ★ UCI Library https://archive.ics.uci.edu/ml/datasets.php ★ Europe Data https://data.europa.eu/en ★ Kaggle https://www.kaggle.com/datasets 10
  • 11. Traditional Corpus Tools ★ AntConc https://www.laurenceanthony.net/software/antconc / ★ WordSmith Tools https://www.lexically.net/wordsmith/ ★ Sketchengine https://www.sketchengine.eu/ ★ LancsBox http://corpora.lancs.ac.uk/lancsbox/ 11
  • 12. Knowledge Discovery Theory 1.“In active data mining paradigm,… rules are discovered, … we describe the constructs for defining shapes, and discuss how the shape predicates are used in a query construct” (Agrawal, & Psaila, 1995). KDD process was a “set of various activities for making sense of data” (Fayyad, Piatetsky-Shapiro, & Smyth, 1996, p. 82). 2.“The extraction of implicit, previously unknown and potentially useful information from data” (Cabena, Hadjinian, Stadler, Verhees, & Zanasi, 1998, p. 9; Witten, Frank, & Hall, 2011). 12
  • 13. Literature and Corpus ● Voyant Tools: https://voyant-tools.org/ ● Z library https://pk1lib.org/ ● Guttenberg Project https://www.gutenberg.org/ ● Hathitrust https://www.hathitrust.org/ 13
  • 14. Corpus Application in Language ★ Voyant Tools: https://voyant- tools.org/ ★ Themes, Phrases (Collocation, colligations, and collostructions) , Knowledge Graphs, Corpus Summary, WSD, genre analysis ★ POS Tagger Tool https://parts-of-speech.info/ 14
  • 15. cont… ★Hedges ★Simile vs verb (like) ★Stylistics ★Forensic ★Coinage, lexicon ★ Language change and shift ★ Register and semantic change ★Interjections 15
  • 16. Knowledge Discovery in Culture ★ Google Ngram viewer and cultural diversity https://books.google.com/ngrams ★ Cultural shades of owl, dog, donkey ★ Comparison of cultures ★ Sports, travel, religion, education, military terms ★ Swearing terms ★ Kinship terminologies ★ Feelings about nationalities, religions ★ Behaviours with LGBT in different cultures 16
  • 17. Text Mining ★ Voyant https://voyant-tools.org/ ★ Gephi https://gephi.org/ ★ Orange https://orangedatamining.com/ ★ Sentiment Analysis Tool: Text2Data https://text2data.com/ 17
  • 18. Image Mining ★ 3D modelling ★ 3D printing ★ Atlas-ti tool (image mining) https://atlasti.com/ 18
  • 19. Audio Mining and Phonetics ★ Dialectology ★ Speech to text ★ Speech to transcription ★ Praat tool https://www.fon.hum.uva.nl/praat/ 19
  • 20. Video Mining ★ Atlas-ti tool (audio, video, image, social media) Qualitative data analysis https://atlasti.com/ ★ Free trial ★ Register ★ Click: Select new project ★ Click: Documents ★ Analysis, ★ share/ export 20
  • 21. Geo-Spatial Data Mining ★ Instant Streetview https://www.instantstreetview.com/ ★ Blender GIS ★ https://sourceforge.net/projects/blender-gis.mirror/ ★ ArcGIS ★ https://www.arcgis.com/index.html 21
  • 22. Social Media Mining SocialSeacher tool https://www.social-searcher.com/ 22
  • 23. Knowledge Discovery in Pakistani Languages E-Library Punjab https://elibrary.punjab.gov.pk/e_books REKHTA SUFINAMA https://sufinama.org/poets/khwaja-ghulam-farid/all Voyant https://voyant-tools.org/ 23
  • 24. Museology and Digital Humanities ★ Thousands of Virtual Museums https://mcn.edu/a-guide-to-virtual-museum-resources/ 24
  • 25. Musicology and Corpus ★ Music Dataset: Lyrics and Metadata from 1950 to 2019 https://data.mendeley.com/datasets/3t9vbwxgr5/2 ★ Music Dataset : 1950 to 2019 https://www.kaggle.com/datasets/saurabhshahane/music -dataset-1950-to-2019 25
  • 27. Quantum Humanities ★ Speed ★ Complexity ★ Many variables 27
  • 28. Access Video and Slides • Video: • Slides: 28