TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
Corpus Knowledge Discovery Techniques
1. Corpus! Thy Name is
Knowledge Discovery
Dr Zafar Ullah
zafarullah76@gmail.com
Scholarly Work 02. Code. 0090
1
2. Why?
★Close reading vs Distant Reading
★What to do with million books?
★Big data
★Databases
★Josephine Miles, Roberto Busa
2
3. Corpus and Approaches
★Why “analyzing and computing patterns
of linguistic form, meaning” (p.195)
★qualitative and quantitative linguistic
analyses
★Corpus based
★Corpus driven
3
4. Types of Corpus
1. Monitor Corpus
(COCA, BOE)
2. Parallel Corpus (translation)
(Open Source Parallel Corpus; English
Norwegian Parallel Corpus)
3. Comparable Corpus
(ICE International Corpus of English)
4. Diachronic Corpus
(Helsinki Corpus of English Texts; COCA, Time
Magazine) 4
5. Cont…
5. Specialized Corpus
(air controller traffic speech; student essays; Early
Modern English Tracts)
6. Multimedia Corpus
(SACODEYL)
7. Representative, Balanced Corpus
(100 million words; 1 million spoken and 1 million
written)
5
8. Web Links of Corpora
• https://www.english-corpora.org/
8
Corpus (online access) Download # words Dialect Time period Genre(s)
iWeb: The Intelligent Web-based
Corpus
14 billion 6 countries 2017 Web
News on the Web (NOW) 11.1 billion+ 20 countries 2010-yesterday Web: News
Global Web-Based English (GloWbE) 1.9 billion 20 countries 2012-13 Web (incl blogs)
Wikipedia Corpus 1.9 billion (Various) 2014 Wikipedia
Corpus of Contemporary American
English (COCA)
1.0 billion American 1990-2019 Balanced
Coronavirus Corpus 626 million+ 20 countries Jan 2020-yesterday Web: News
Corpus of Historical American
English (COHA)
400 million American 1810-2009 Balanced
The TV Corpus 325 million 6 countries 1950-2018 TV shows
The Movie Corpus 200 million 6 countries 1930-2018 Movies
Corpus of American Soap Operas 100 million American 2001-2012 TV shows
9. Some Other Corpora
Hansard Corpus 1.6 billion British 1803-2005 Parliament
Early English Books Online 755 million British 1470s-1690s (Various)
Corpus of US Supreme
Court Opinions
130 million American 1790s-present Legal opinions
TIME Magazine Corpus 100 million American 1923-2006 Magazine
British National Corpus
(BNC) *
100 million British 1980s-1993 Balanced
Strathy Corpus (Canada) 50 million Canadian 1970s-2000s Balanced
CORE Corpus 50 million 6 countries 2014 Web
From Google Books n-
grams (compare)
American English 155 billion American 1500s-2000s (Various)
British English 34 billion British 1500s-2000 (Various)
9
10. Free Databases
★ Mendeley Data
https://data.mendeley.com/research-data/
★ Google dataset
https://datasetsearch.research.google.com/
★ UCI Library
https://archive.ics.uci.edu/ml/datasets.php
★ Europe Data
https://data.europa.eu/en
★ Kaggle
https://www.kaggle.com/datasets
10
11. Traditional Corpus Tools
★ AntConc
https://www.laurenceanthony.net/software/antconc
/
★ WordSmith Tools
https://www.lexically.net/wordsmith/
★ Sketchengine
https://www.sketchengine.eu/
★ LancsBox
http://corpora.lancs.ac.uk/lancsbox/ 11
12. Knowledge Discovery Theory
1.“In active data mining paradigm,… rules are discovered, … we
describe the constructs for defining shapes, and discuss how the shape
predicates are used in a query construct” (Agrawal, & Psaila, 1995).
KDD process was a “set of various activities for making sense of data”
(Fayyad, Piatetsky-Shapiro, & Smyth, 1996, p. 82).
2.“The extraction of implicit, previously unknown and potentially
useful information from data” (Cabena, Hadjinian, Stadler, Verhees, &
Zanasi, 1998, p. 9; Witten, Frank, & Hall, 2011).
12
13. Literature and Corpus
● Voyant Tools: https://voyant-tools.org/
● Z library https://pk1lib.org/
● Guttenberg Project
https://www.gutenberg.org/
● Hathitrust https://www.hathitrust.org/
13
14. Corpus Application in Language
★ Voyant Tools: https://voyant-
tools.org/
★ Themes, Phrases (Collocation, colligations,
and collostructions) , Knowledge Graphs,
Corpus Summary, WSD, genre analysis
★ POS Tagger Tool
https://parts-of-speech.info/
14
15. cont…
★Hedges
★Simile vs verb (like)
★Stylistics
★Forensic
★Coinage, lexicon
★ Language change and shift
★ Register and semantic change
★Interjections
15
16. Knowledge Discovery in Culture
★ Google Ngram viewer and cultural diversity
https://books.google.com/ngrams
★ Cultural shades of owl, dog, donkey
★ Comparison of cultures
★ Sports, travel, religion, education, military terms
★ Swearing terms
★ Kinship terminologies
★ Feelings about nationalities, religions
★ Behaviours with LGBT in different cultures
16
23. Knowledge Discovery in Pakistani
Languages
E-Library Punjab
https://elibrary.punjab.gov.pk/e_books
REKHTA
SUFINAMA
https://sufinama.org/poets/khwaja-ghulam-farid/all
Voyant
https://voyant-tools.org/
23
24. Museology and Digital Humanities
★ Thousands of Virtual Museums
https://mcn.edu/a-guide-to-virtual-museum-resources/
24
25. Musicology and Corpus
★ Music Dataset: Lyrics and Metadata
from 1950 to 2019
https://data.mendeley.com/datasets/3t9vbwxgr5/2
★ Music Dataset : 1950 to 2019
https://www.kaggle.com/datasets/saurabhshahane/music
-dataset-1950-to-2019
25