SlideShare a Scribd company logo
Revisiting What Counts as a
Word: The development of


New Word Level Checker
Atsushi Mizumoto (Kansai University)
JAECS 2021 Spring Symposium
Corpus Tools and Statistical Methods (TASM) SIG
nwlc.pythonanywhere.com
Outline
• Word Lists


• Word Pro
fi
lers


• How to Count Words


• Development of NWLC


• Conclusion
Word Lists
• General Service List (GSL) (West, 1953)


• BNC Lemmatized Frequency List


(Kilgariff, 1996)


• Academic Word List (AWL) (Coxhead, 2000)


• BNC2000 (Nation, 2006)

• BNC/COCA2000 (Nation, 2012)
Word Lists
• New General Service List (new-GSL)


(Brezina & Gablasova, 2013)


• New General Service List (NGSL)


(Browne et al., 2014)


• New Academic Word List (NAWL)


(Browne et al., 2014) *Cambridge English Corpus


• Academic Vocabulary List (AVL)


(Gardner & Davies, 2014)
2003 2016
Outline
• Word Lists


• Word Pro
fi
lers


• How to Count Words


• Development of NWLC


• Conclusion
2021-0509_JAECS2021_Spring
https://www.laurenceanthony.net/software/antquicktools/
2021-0509_JAECS2021_Spring
http://dd.kyushu-u.ac.jp/ uchida/cvla.html
http://someya-net.com/wlc/
Outline
• Word Lists


• Word Pro
fi
lers


• How to Count Words


• Development of NWLC


• Conclusion
How to Count Words
• Token (= total number)


e.g., A good wine is a wine that you like.


• Type (= unique words)


e.g., A good wine is a wine that you like.
9
7
type/token ratio (TTR)
= a measure of lexical richness
How Many Words?
Don't be trapped by dogma — which is living
with the results of other people’s thinking.


(A) 15


(B) 16


(C) 17


(D) 18
How Many Words?
Don't be trapped by dogma — which is living
with the results of other people’s thinking.


(A) 15 => MS Word


(B) 16 => Word Pro
fi
lers (Do not & w/o ’s)


(C) 17


(D) 18
1 2 3 4 5 6 7
9 10 11 12 13
8
14 15
How Many Words?
An Osaka-based "idol group" made up of
women whose average age is 66 has released
a rap-style music video in English.


21
- separate
d

w/o numbers
- unseparate
d

w/o numbers
22
20
1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21
Token De
fi
nition Matters!
• word family


(base + in
fl
ected forms + derivatives w/o POS)


e.g., study, studies (n), study, studies, studied, studying (v),
studied (j), studiously (r), studious (j), studying (n)


• lemma (base + in
fl
ected forms w/ POS)


e.g., study (n), study (v), studied (j), studiously (r), studious (j),
studying (n)


•
fl
emma (family lemma) (base + in
fl
ected forms w/o POS)
e.g., study (n, v), studied (j), studiously (r), studious (j)
Word Counting Unit
McLean (2018)
Tokenization
Referring to
Existing
Word List
Lemmatization
WF/Lemma
• abbreviate -> 

abbreviate, abbreviates, abbreviating,
abbreviate
d

• abide -> 

abide, abode, abided, abides, abidin
g

• ability -> 

ability, abilities
(f)lemma list
2021-0509_JAECS2021_Spring
Outline
• Word Lists


• Word Pro
fi
lers


• How to Count Words


• Development of NWLC


• Conclusion
Word Lists on NWLC
1. New JACET8000 (JACET, 2016)


2. SVL12000 (ALC, 2001)


3. New General Service List (NGSL) + 3 Lists
(Browne et al., 2013)


4. CEFR-J (Tono, 2019)


5. SEWK-J (Pinchbeck, in preparation)
2021-0509_JAECS2021_Spring
New JACET8000
• Updated version of JACET8000 (JACET, 2003)
• British National Corpus (BNC) and the Corpus
of Contemporary American English (COCA)
and other lists


• An educational word list for Japanese learners
of English, especially university students
SVL12000
• ALC Press, Inc. (2001)


• British National Corpus (BNC)


• Some subjective selection and ranking of words
• 12,000 words divided into 12 levels


• Many ALC materials use this list.
NGSL + 3 Lists
• Developed by Dr. Charles Browne and his colleagues (2013)
• Modern update of the General Service List


(West, 1953)


• 273 million-word subsection of the 2 billion word
Cambridge English Corpus (CEC)


• Approximately 2,800 high frequency words (NGSL)


• Covers about 90 percent of general texts of English
NGSL + 3 Lists
• 3 additional lists


- New Academic Word List (NAWL)


- TOEIC Service List (TSL)


- Business Service List (BSL)
NGSL + 3 Lists
5,621 words without overlapping words


Level 1: NGSL = 2,801 words


Level 2: NAWL/TOEIC/BSL = 183 words

Level 3: NAWL/TOEIC, NAWL/BSL, or TOEIC/BSL = 790 words


Level 4: Only in NAWL, TOEIC, or BSL = 1,847 words
CEFR-J Wordlist
• “The textbook corpora consist of the major English
textbooks used at primary to secondary schools (Years
3 to 10) in China, Korea, and Taiwan.”


• “All the subcorpora were initially classi
fi
ed according to
the CEFR levels, based on the analysis of the Course of
Study provided in each country/region.”


• Compared with the English Vocabulary Pro
fi
le.


• “All the words have part-of-speech information and
corresponding CEFR levels.”
http://www.cefr-j.org/download.html#cefrj_wordlist
CEFR-J Wordlist
SEWK-J
• Scale of English Word Knowledge - Japanese
• 74,810-word list created by Dr. Geoff Pinchbeck
(Paper in preparation)


• SEWK-J list estimates the likelihood that a word
is known to Japanese university students.


• "SEWK-J: Fine-grained” is for users interested in
a detailed understanding of lower-level (more
frequent) words’ pro
fi
ling.
2021-0509_JAECS2021_Spring
flemma List
• For New JACET8000 and SVL12000, AntBNC Lemma
List (by Dr. Laurence Anthony) is used.


• AntBNC Lemma List = All words in the BNC corpus with
a frequency greater than two.


• Modi
fi
cations were manually made to match the
headwords of New JACET8000 and SVL12000.


e.g., "interesting" and "interested" in both lists


— They were excluded from the lemma entry "interest"


(interest = interest, interests)
flemma List
• Headwords with British spellings in New
JACET8000 and SVL12000 are included in the
revised lemma list


e.g., advertise = advertise, advertised, advertises, advertising,
advertize, advertizes, advertized, advertizing


• For NGSL and SEWK-J, the lemma lists are
provided by the word list developers.
• POS prediction is based on pretrained statistical
models (i.e., examples model has seen during
training).


• Accuracy of POS tagging = 97.05 %


• Much faster than other modules in Python


(e.g., nltk).
POS Tagging and Lemmatization
For CEFR-J
Examples of POS Tagging
Proper Nouns and Numerals
• In NWLC, proper nouns and numerals
(numbers) are
fi
rst identi
fi
ed using spaCy.


• Those are treated as possibly “known” words
because they can be assumed to be
understood by learners.


• The possessive ’s (e.g., Todd’s dog) is also
put into this category.
A word
Prop Noun /
Number?
Known
YES NO
Headword in the
Selected List?
YES NO
Level In the (F)lemma list ?
YES NO
Level NA
For CEFR-J,
spaCy is used.
2021-0509_JAECS2021_Spring
Capitalized Letters
= The three word lists are case-sensitive.
Contracted Forms
Words with Periods (i.e., abbreviated words)
Hyphenated Words
Osaka-based = Osaka and based (2 words)
Compounds/Multi-word Units
Development and Deployment
• Python


• Flask


(micro web framework)


• js (Bootstrap)


+ CSS
2021-0509_JAECS2021_Spring
2021-0509_JAECS2021_Spring
1
2
3
• Unsupervised approac
h

• Better performance than other approaches

(Campos et al., 2020)

TF.IDF, KP-Miner, RAKE, TextRank, SingleRank,
ExpandRank, TopicRank, TopicalPageRank,
PositionRank, MultipartiteRank
YAKE!
https://github.com/LIAAD/yake
(Yet Another Keyword Extractor)
2021-0509_JAECS2021_Spring
Outline
• Word Lists


• Word Pro
fi
lers


• How to Count Words


• Development of NWLC


• Conclusion
Why Counting a Word is an Issue?
• Most stats are based on frequency (word
counts) in a corpus.


• Results may not be reproducible.


• “95–98% coverage is necessary for reading
and listening” — Do we really know which
words can be counted or not?
Suggestions
• We should pay more attention to how we count
words, including the word counting unit.


• “replication crisis” (39% in Psychology, Open Science Collaboration, 2015)
• Transparency and openness in science
https://nobaproject.com/modules/the-replication-crisis-in-psychology
Open Science Attempt


in Another Field
https://us.sagepub.com/en-us/nam/journal/research-politics#submission-guidelines
Suggestions (Cont.)
• Word list developers should establish consistent
rules for choosing headwords.


• Other considerations: Case-sensitive (e.g.,
March, May, US, IT), contracted forms, words
with periods, hyphenated words, compounds?
• They should also provide a (f)lemma list for their
word list (or specify the POS-tagger).
2021-0509_JAECS2021_Spring
nwlc.pythonanywhere.com

More Related Content

What's hot

Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
Marina Santini
 
Lexical bundles
Lexical bundlesLexical bundles
Lexical bundles
Vasundhara Rawat
 
Presentation
PresentationPresentation
Presentation
Pranava Swaroop
 
Hassan presentation of corpus
Hassan presentation of corpusHassan presentation of corpus
Hassan presentation of corpus
Hassan Ammar
 
DL'12 dl-lite explanations
DL'12 dl-lite explanationsDL'12 dl-lite explanations
DL'12 dl-lite explanations
Mariano Rodriguez-Muro
 
The corpus research method
The corpus research methodThe corpus research method
The corpus research method
Masahiro Nishimura
 
End-to-End Plural Coreference Resolution on TV Show Transcripts
End-to-End Plural Coreference Resolution on TV Show TranscriptsEnd-to-End Plural Coreference Resolution on TV Show Transcripts
End-to-End Plural Coreference Resolution on TV Show Transcripts
Jinho Choi
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
MLconf
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Sawood Alam
 
Wordnet
WordnetWordnet
Wordnet
Govind Raj
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Seonghyun Kim
 
Mind map esl 502
Mind map esl 502Mind map esl 502
Mind map esl 502
Carinne
 

What's hot (12)

Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Lexical bundles
Lexical bundlesLexical bundles
Lexical bundles
 
Presentation
PresentationPresentation
Presentation
 
Hassan presentation of corpus
Hassan presentation of corpusHassan presentation of corpus
Hassan presentation of corpus
 
DL'12 dl-lite explanations
DL'12 dl-lite explanationsDL'12 dl-lite explanations
DL'12 dl-lite explanations
 
The corpus research method
The corpus research methodThe corpus research method
The corpus research method
 
End-to-End Plural Coreference Resolution on TV Show Transcripts
End-to-End Plural Coreference Resolution on TV Show TranscriptsEnd-to-End Plural Coreference Resolution on TV Show Transcripts
End-to-End Plural Coreference Resolution on TV Show Transcripts
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
Wordnet
WordnetWordnet
Wordnet
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Mind map esl 502
Mind map esl 502Mind map esl 502
Mind map esl 502
 

Similar to 2021-0509_JAECS2021_Spring

Enhancing Language Learning Using Corpora
Enhancing Language Learning Using CorporaEnhancing Language Learning Using Corpora
Enhancing Language Learning Using Corpora
michaelbarlow
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learning
Lena Shakurova
 
NLP_KASHK:POS Tagging
NLP_KASHK:POS TaggingNLP_KASHK:POS Tagging
NLP_KASHK:POS Tagging
Hemantha Kulathilake
 
An exploratory corpus study of the AP Spanish
An exploratory corpus study of the AP SpanishAn exploratory corpus study of the AP Spanish
An exploratory corpus study of the AP Spanish
Steven Saffels
 
Academic-Phrasebank.pdf
Academic-Phrasebank.pdfAcademic-Phrasebank.pdf
Academic-Phrasebank.pdf
SirajudinAkmel1
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014
SVTaylor123
 
SIBAU Foundation Vocabulary
SIBAU Foundation VocabularySIBAU Foundation Vocabulary
SIBAU Foundation Vocabulary
AliAqsamAbbasi
 
1001 Vocabulary and Spelling Questions
1001 Vocabulary and Spelling Questions1001 Vocabulary and Spelling Questions
1001 Vocabulary and Spelling Questions
Joy Celestial
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Seth Grimes
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
Seth Grimes
 
A Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 DevelopmentA Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 Development
CALPER
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014
SVTaylor123
 
Digging Deeper into the Common Core
Digging Deeper into the Common CoreDigging Deeper into the Common Core
Digging Deeper into the Common Core
National Resource Center for Paraprofessionals
 
Tesol 2010 Boston
Tesol 2010 BostonTesol 2010 Boston
Tesol 2010 Boston
James Purpura
 
Ir 03
Ir   03Ir   03
A Review Of The Research On The Evaluation Of ESL Writing
A Review Of The Research On The Evaluation Of ESL WritingA Review Of The Research On The Evaluation Of ESL Writing
A Review Of The Research On The Evaluation Of ESL Writing
Lori Mitchell
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
Sean Golliher
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
hs0041
 
Spelling Strategies
Spelling StrategiesSpelling Strategies
Spelling Strategies
Abby Phelan
 
Syntax.ppt
Syntax.pptSyntax.ppt
Syntax.ppt
KhenAguinillo
 

Similar to 2021-0509_JAECS2021_Spring (20)

Enhancing Language Learning Using Corpora
Enhancing Language Learning Using CorporaEnhancing Language Learning Using Corpora
Enhancing Language Learning Using Corpora
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learning
 
NLP_KASHK:POS Tagging
NLP_KASHK:POS TaggingNLP_KASHK:POS Tagging
NLP_KASHK:POS Tagging
 
An exploratory corpus study of the AP Spanish
An exploratory corpus study of the AP SpanishAn exploratory corpus study of the AP Spanish
An exploratory corpus study of the AP Spanish
 
Academic-Phrasebank.pdf
Academic-Phrasebank.pdfAcademic-Phrasebank.pdf
Academic-Phrasebank.pdf
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014
 
SIBAU Foundation Vocabulary
SIBAU Foundation VocabularySIBAU Foundation Vocabulary
SIBAU Foundation Vocabulary
 
1001 Vocabulary and Spelling Questions
1001 Vocabulary and Spelling Questions1001 Vocabulary and Spelling Questions
1001 Vocabulary and Spelling Questions
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
 
A Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 DevelopmentA Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 Development
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014
 
Digging Deeper into the Common Core
Digging Deeper into the Common CoreDigging Deeper into the Common Core
Digging Deeper into the Common Core
 
Tesol 2010 Boston
Tesol 2010 BostonTesol 2010 Boston
Tesol 2010 Boston
 
Ir 03
Ir   03Ir   03
Ir 03
 
A Review Of The Research On The Evaluation Of ESL Writing
A Review Of The Research On The Evaluation Of ESL WritingA Review Of The Research On The Evaluation Of ESL Writing
A Review Of The Research On The Evaluation Of ESL Writing
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
 
Spelling Strategies
Spelling StrategiesSpelling Strategies
Spelling Strategies
 
Syntax.ppt
Syntax.pptSyntax.ppt
Syntax.ppt
 

More from Mizumoto Atsushi

2015-1003 英語コーパス学会ワークショップ使用スライド
2015-1003 英語コーパス学会ワークショップ使用スライド2015-1003 英語コーパス学会ワークショップ使用スライド
2015-1003 英語コーパス学会ワークショップ使用スライド
Mizumoto Atsushi
 
LET2015 National Conference Seminar
LET2015 National Conference SeminarLET2015 National Conference Seminar
LET2015 National Conference Seminar
Mizumoto Atsushi
 
JSSS2014 Symposium (Atsushi Mizumoto)
JSSS2014 Symposium (Atsushi Mizumoto)JSSS2014 Symposium (Atsushi Mizumoto)
JSSS2014 Symposium (Atsushi Mizumoto)
Mizumoto Atsushi
 
SappoRo.R #3 LT: Shiny by RStudio
SappoRo.R #3 LT: Shiny by RStudioSappoRo.R #3 LT: Shiny by RStudio
SappoRo.R #3 LT: Shiny by RStudio
Mizumoto Atsushi
 
量的データの分析・報告で気をつけたいこと
量的データの分析・報告で気をつけたいこと量的データの分析・報告で気をつけたいこと
量的データの分析・報告で気をつけたいこと
Mizumoto Atsushi
 
Creating an in-house computerized adaptive testing (CAT) program with Concerto
Creating an in-house computerized adaptive testing (CAT) program with ConcertoCreating an in-house computerized adaptive testing (CAT) program with Concerto
Creating an in-house computerized adaptive testing (CAT) program with Concerto
Mizumoto Atsushi
 
2013全国英語教育学会WS公開用スライド
2013全国英語教育学会WS公開用スライド2013全国英語教育学会WS公開用スライド
2013全国英語教育学会WS公開用スライド
Mizumoto Atsushi
 
Rを使ったコンピュータ適応型テスト構築の試み
Rを使ったコンピュータ適応型テスト構築の試みRを使ったコンピュータ適応型テスト構築の試み
Rを使ったコンピュータ適応型テスト構築の試み
Mizumoto Atsushi
 
Let中部2012シンポスライド
Let中部2012シンポスライドLet中部2012シンポスライド
Let中部2012シンポスライド
Mizumoto Atsushi
 
Excelを使った統計解析とグラフ化入門
Excelを使った統計解析とグラフ化入門Excelを使った統計解析とグラフ化入門
Excelを使った統計解析とグラフ化入門
Mizumoto Atsushi
 
2012-1110「マルチレベルモデルのはなし」(censored)
2012-1110「マルチレベルモデルのはなし」(censored)2012-1110「マルチレベルモデルのはなし」(censored)
2012-1110「マルチレベルモデルのはなし」(censored)
Mizumoto Atsushi
 

More from Mizumoto Atsushi (12)

2015-1003 英語コーパス学会ワークショップ使用スライド
2015-1003 英語コーパス学会ワークショップ使用スライド2015-1003 英語コーパス学会ワークショップ使用スライド
2015-1003 英語コーパス学会ワークショップ使用スライド
 
LET2015 National Conference Seminar
LET2015 National Conference SeminarLET2015 National Conference Seminar
LET2015 National Conference Seminar
 
JSSS2014 Symposium (Atsushi Mizumoto)
JSSS2014 Symposium (Atsushi Mizumoto)JSSS2014 Symposium (Atsushi Mizumoto)
JSSS2014 Symposium (Atsushi Mizumoto)
 
SappoRo.R #3 LT: Shiny by RStudio
SappoRo.R #3 LT: Shiny by RStudioSappoRo.R #3 LT: Shiny by RStudio
SappoRo.R #3 LT: Shiny by RStudio
 
量的データの分析・報告で気をつけたいこと
量的データの分析・報告で気をつけたいこと量的データの分析・報告で気をつけたいこと
量的データの分析・報告で気をつけたいこと
 
2013 11 jacet-kansai-ws
2013 11 jacet-kansai-ws2013 11 jacet-kansai-ws
2013 11 jacet-kansai-ws
 
Creating an in-house computerized adaptive testing (CAT) program with Concerto
Creating an in-house computerized adaptive testing (CAT) program with ConcertoCreating an in-house computerized adaptive testing (CAT) program with Concerto
Creating an in-house computerized adaptive testing (CAT) program with Concerto
 
2013全国英語教育学会WS公開用スライド
2013全国英語教育学会WS公開用スライド2013全国英語教育学会WS公開用スライド
2013全国英語教育学会WS公開用スライド
 
Rを使ったコンピュータ適応型テスト構築の試み
Rを使ったコンピュータ適応型テスト構築の試みRを使ったコンピュータ適応型テスト構築の試み
Rを使ったコンピュータ適応型テスト構築の試み
 
Let中部2012シンポスライド
Let中部2012シンポスライドLet中部2012シンポスライド
Let中部2012シンポスライド
 
Excelを使った統計解析とグラフ化入門
Excelを使った統計解析とグラフ化入門Excelを使った統計解析とグラフ化入門
Excelを使った統計解析とグラフ化入門
 
2012-1110「マルチレベルモデルのはなし」(censored)
2012-1110「マルチレベルモデルのはなし」(censored)2012-1110「マルチレベルモデルのはなし」(censored)
2012-1110「マルチレベルモデルのはなし」(censored)
 

Recently uploaded

How to Manage Early Receipt Printing in Odoo 17 POS
How to Manage Early Receipt Printing in Odoo 17 POSHow to Manage Early Receipt Printing in Odoo 17 POS
How to Manage Early Receipt Printing in Odoo 17 POS
Celine George
 
How to Add a Filter in the Odoo 17 - Odoo 17 Slides
How to Add a Filter in the Odoo 17 - Odoo 17 SlidesHow to Add a Filter in the Odoo 17 - Odoo 17 Slides
How to Add a Filter in the Odoo 17 - Odoo 17 Slides
Celine George
 
Genetics Teaching Plan: Dr.Kshirsagar R.V.
Genetics Teaching Plan: Dr.Kshirsagar R.V.Genetics Teaching Plan: Dr.Kshirsagar R.V.
Genetics Teaching Plan: Dr.Kshirsagar R.V.
DrRavindrakshirsagar1
 
New Features in Odoo 17 Sign - Odoo 17 Slides
New Features in Odoo 17 Sign - Odoo 17 SlidesNew Features in Odoo 17 Sign - Odoo 17 Slides
New Features in Odoo 17 Sign - Odoo 17 Slides
Celine George
 
How to Create & Publish a Blog in Odoo 17 Website
How to Create & Publish a Blog in Odoo 17 WebsiteHow to Create & Publish a Blog in Odoo 17 Website
How to Create & Publish a Blog in Odoo 17 Website
Celine George
 
Neuroimaging Mastery Project: Presentation #6 Subarachnoid Hemorrhage
Neuroimaging Mastery Project: Presentation #6 Subarachnoid HemorrhageNeuroimaging Mastery Project: Presentation #6 Subarachnoid Hemorrhage
Neuroimaging Mastery Project: Presentation #6 Subarachnoid Hemorrhage
Sean M. Fox
 
Split Shifts From Gantt View in the Odoo 17
Split Shifts From Gantt View in the  Odoo 17Split Shifts From Gantt View in the  Odoo 17
Split Shifts From Gantt View in the Odoo 17
Celine George
 
Individual Performance Commitment Review Form-Developmental Plan.docx
Individual Performance Commitment Review Form-Developmental Plan.docxIndividual Performance Commitment Review Form-Developmental Plan.docx
Individual Performance Commitment Review Form-Developmental Plan.docx
monicaaringo1
 
Webinar Innovative assessments for SOcial Emotional Skills
Webinar Innovative assessments for SOcial Emotional SkillsWebinar Innovative assessments for SOcial Emotional Skills
Webinar Innovative assessments for SOcial Emotional Skills
EduSkills OECD
 
modul ajar kelas x bahasa inggris 2024-2025
modul ajar kelas x bahasa inggris 2024-2025modul ajar kelas x bahasa inggris 2024-2025
modul ajar kelas x bahasa inggris 2024-2025
NurFitriah45
 
National Learning Camp Grade 7 ENGLISH 7-LESSON 7.pptx
National Learning Camp Grade 7 ENGLISH 7-LESSON 7.pptxNational Learning Camp Grade 7 ENGLISH 7-LESSON 7.pptx
National Learning Camp Grade 7 ENGLISH 7-LESSON 7.pptx
EdsNatividad
 
Imagination in Computer Science Research
Imagination in Computer Science ResearchImagination in Computer Science Research
Imagination in Computer Science Research
Abhik Roychoudhury
 
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptx
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptxUnlocking Educational Synergy-DIKSHA & Google Classroom.pptx
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptx
bipin95
 
NLC English 7 Consolidation Lesson plan for teacher
NLC English 7 Consolidation Lesson plan for teacherNLC English 7 Consolidation Lesson plan for teacher
NLC English 7 Consolidation Lesson plan for teacher
AngelicaLubrica
 
2024 KWL Back 2 School Summer Conference
2024 KWL Back 2 School Summer Conference2024 KWL Back 2 School Summer Conference
2024 KWL Back 2 School Summer Conference
KlettWorldLanguages
 
Odoo 17 Social Marketing - Lead Generation On Facebook
Odoo 17 Social Marketing - Lead Generation On FacebookOdoo 17 Social Marketing - Lead Generation On Facebook
Odoo 17 Social Marketing - Lead Generation On Facebook
Celine George
 
How to Handle the Separate Discount Account on Invoice in Odoo 17
How to Handle the Separate Discount Account on Invoice in Odoo 17How to Handle the Separate Discount Account on Invoice in Odoo 17
How to Handle the Separate Discount Account on Invoice in Odoo 17
Celine George
 
Edukasyong Pantahanan at Pangkabuhayan 1: Personal Hygiene
Edukasyong Pantahanan at  Pangkabuhayan 1: Personal HygieneEdukasyong Pantahanan at  Pangkabuhayan 1: Personal Hygiene
Edukasyong Pantahanan at Pangkabuhayan 1: Personal Hygiene
MJDuyan
 
DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...
DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...
DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...
thanhluan21
 
How to Add Colour Kanban Records in Odoo 17 Notebook
How to Add Colour Kanban Records in Odoo 17 NotebookHow to Add Colour Kanban Records in Odoo 17 Notebook
How to Add Colour Kanban Records in Odoo 17 Notebook
Celine George
 

Recently uploaded (20)

How to Manage Early Receipt Printing in Odoo 17 POS
How to Manage Early Receipt Printing in Odoo 17 POSHow to Manage Early Receipt Printing in Odoo 17 POS
How to Manage Early Receipt Printing in Odoo 17 POS
 
How to Add a Filter in the Odoo 17 - Odoo 17 Slides
How to Add a Filter in the Odoo 17 - Odoo 17 SlidesHow to Add a Filter in the Odoo 17 - Odoo 17 Slides
How to Add a Filter in the Odoo 17 - Odoo 17 Slides
 
Genetics Teaching Plan: Dr.Kshirsagar R.V.
Genetics Teaching Plan: Dr.Kshirsagar R.V.Genetics Teaching Plan: Dr.Kshirsagar R.V.
Genetics Teaching Plan: Dr.Kshirsagar R.V.
 
New Features in Odoo 17 Sign - Odoo 17 Slides
New Features in Odoo 17 Sign - Odoo 17 SlidesNew Features in Odoo 17 Sign - Odoo 17 Slides
New Features in Odoo 17 Sign - Odoo 17 Slides
 
How to Create & Publish a Blog in Odoo 17 Website
How to Create & Publish a Blog in Odoo 17 WebsiteHow to Create & Publish a Blog in Odoo 17 Website
How to Create & Publish a Blog in Odoo 17 Website
 
Neuroimaging Mastery Project: Presentation #6 Subarachnoid Hemorrhage
Neuroimaging Mastery Project: Presentation #6 Subarachnoid HemorrhageNeuroimaging Mastery Project: Presentation #6 Subarachnoid Hemorrhage
Neuroimaging Mastery Project: Presentation #6 Subarachnoid Hemorrhage
 
Split Shifts From Gantt View in the Odoo 17
Split Shifts From Gantt View in the  Odoo 17Split Shifts From Gantt View in the  Odoo 17
Split Shifts From Gantt View in the Odoo 17
 
Individual Performance Commitment Review Form-Developmental Plan.docx
Individual Performance Commitment Review Form-Developmental Plan.docxIndividual Performance Commitment Review Form-Developmental Plan.docx
Individual Performance Commitment Review Form-Developmental Plan.docx
 
Webinar Innovative assessments for SOcial Emotional Skills
Webinar Innovative assessments for SOcial Emotional SkillsWebinar Innovative assessments for SOcial Emotional Skills
Webinar Innovative assessments for SOcial Emotional Skills
 
modul ajar kelas x bahasa inggris 2024-2025
modul ajar kelas x bahasa inggris 2024-2025modul ajar kelas x bahasa inggris 2024-2025
modul ajar kelas x bahasa inggris 2024-2025
 
National Learning Camp Grade 7 ENGLISH 7-LESSON 7.pptx
National Learning Camp Grade 7 ENGLISH 7-LESSON 7.pptxNational Learning Camp Grade 7 ENGLISH 7-LESSON 7.pptx
National Learning Camp Grade 7 ENGLISH 7-LESSON 7.pptx
 
Imagination in Computer Science Research
Imagination in Computer Science ResearchImagination in Computer Science Research
Imagination in Computer Science Research
 
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptx
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptxUnlocking Educational Synergy-DIKSHA & Google Classroom.pptx
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptx
 
NLC English 7 Consolidation Lesson plan for teacher
NLC English 7 Consolidation Lesson plan for teacherNLC English 7 Consolidation Lesson plan for teacher
NLC English 7 Consolidation Lesson plan for teacher
 
2024 KWL Back 2 School Summer Conference
2024 KWL Back 2 School Summer Conference2024 KWL Back 2 School Summer Conference
2024 KWL Back 2 School Summer Conference
 
Odoo 17 Social Marketing - Lead Generation On Facebook
Odoo 17 Social Marketing - Lead Generation On FacebookOdoo 17 Social Marketing - Lead Generation On Facebook
Odoo 17 Social Marketing - Lead Generation On Facebook
 
How to Handle the Separate Discount Account on Invoice in Odoo 17
How to Handle the Separate Discount Account on Invoice in Odoo 17How to Handle the Separate Discount Account on Invoice in Odoo 17
How to Handle the Separate Discount Account on Invoice in Odoo 17
 
Edukasyong Pantahanan at Pangkabuhayan 1: Personal Hygiene
Edukasyong Pantahanan at  Pangkabuhayan 1: Personal HygieneEdukasyong Pantahanan at  Pangkabuhayan 1: Personal Hygiene
Edukasyong Pantahanan at Pangkabuhayan 1: Personal Hygiene
 
DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...
DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...
DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...
 
How to Add Colour Kanban Records in Odoo 17 Notebook
How to Add Colour Kanban Records in Odoo 17 NotebookHow to Add Colour Kanban Records in Odoo 17 Notebook
How to Add Colour Kanban Records in Odoo 17 Notebook
 

2021-0509_JAECS2021_Spring

  • 1. Revisiting What Counts as a Word: The development of 
 New Word Level Checker Atsushi Mizumoto (Kansai University) JAECS 2021 Spring Symposium Corpus Tools and Statistical Methods (TASM) SIG
  • 3. Outline • Word Lists • Word Pro fi lers • How to Count Words • Development of NWLC • Conclusion
  • 4. Word Lists • General Service List (GSL) (West, 1953) • BNC Lemmatized Frequency List 
 (Kilgariff, 1996) • Academic Word List (AWL) (Coxhead, 2000) • BNC2000 (Nation, 2006) • BNC/COCA2000 (Nation, 2012)
  • 5. Word Lists • New General Service List (new-GSL) 
 (Brezina & Gablasova, 2013) • New General Service List (NGSL) 
 (Browne et al., 2014) • New Academic Word List (NAWL) 
 (Browne et al., 2014) *Cambridge English Corpus • Academic Vocabulary List (AVL) 
 (Gardner & Davies, 2014)
  • 7. Outline • Word Lists • Word Pro fi lers • How to Count Words • Development of NWLC • Conclusion
  • 13. Outline • Word Lists • Word Pro fi lers • How to Count Words • Development of NWLC • Conclusion
  • 14. How to Count Words • Token (= total number) 
 e.g., A good wine is a wine that you like. • Type (= unique words) 
 e.g., A good wine is a wine that you like. 9 7 type/token ratio (TTR) = a measure of lexical richness
  • 15. How Many Words? Don't be trapped by dogma — which is living with the results of other people’s thinking. (A) 15 
 (B) 16 
 (C) 17 
 (D) 18
  • 16. How Many Words? Don't be trapped by dogma — which is living with the results of other people’s thinking. (A) 15 => MS Word 
 (B) 16 => Word Pro fi lers (Do not & w/o ’s) 
 (C) 17 
 (D) 18 1 2 3 4 5 6 7 9 10 11 12 13 8 14 15
  • 17. How Many Words? An Osaka-based "idol group" made up of women whose average age is 66 has released a rap-style music video in English. 21 - separate d w/o numbers - unseparate d w/o numbers 22 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
  • 19. • word family 
 (base + in fl ected forms + derivatives w/o POS) 
 e.g., study, studies (n), study, studies, studied, studying (v), studied (j), studiously (r), studious (j), studying (n) • lemma (base + in fl ected forms w/ POS) 
 e.g., study (n), study (v), studied (j), studiously (r), studious (j), studying (n) • fl emma (family lemma) (base + in fl ected forms w/o POS) e.g., study (n, v), studied (j), studiously (r), studious (j) Word Counting Unit
  • 22. • abbreviate -> 
 abbreviate, abbreviates, abbreviating, abbreviate d • abide -> 
 abide, abode, abided, abides, abidin g • ability -> 
 ability, abilities (f)lemma list
  • 24. Outline • Word Lists • Word Pro fi lers • How to Count Words • Development of NWLC • Conclusion
  • 25. Word Lists on NWLC 1. New JACET8000 (JACET, 2016) 2. SVL12000 (ALC, 2001) 3. New General Service List (NGSL) + 3 Lists (Browne et al., 2013) 4. CEFR-J (Tono, 2019) 5. SEWK-J (Pinchbeck, in preparation)
  • 27. New JACET8000 • Updated version of JACET8000 (JACET, 2003) • British National Corpus (BNC) and the Corpus of Contemporary American English (COCA) and other lists • An educational word list for Japanese learners of English, especially university students
  • 28. SVL12000 • ALC Press, Inc. (2001) • British National Corpus (BNC) • Some subjective selection and ranking of words • 12,000 words divided into 12 levels • Many ALC materials use this list.
  • 29. NGSL + 3 Lists • Developed by Dr. Charles Browne and his colleagues (2013) • Modern update of the General Service List 
 (West, 1953) • 273 million-word subsection of the 2 billion word Cambridge English Corpus (CEC) • Approximately 2,800 high frequency words (NGSL) • Covers about 90 percent of general texts of English
  • 30. NGSL + 3 Lists • 3 additional lists 
 - New Academic Word List (NAWL) 
 - TOEIC Service List (TSL) 
 - Business Service List (BSL)
  • 31. NGSL + 3 Lists 5,621 words without overlapping words Level 1: NGSL = 2,801 words 
 Level 2: NAWL/TOEIC/BSL = 183 words
 Level 3: NAWL/TOEIC, NAWL/BSL, or TOEIC/BSL = 790 words 
 Level 4: Only in NAWL, TOEIC, or BSL = 1,847 words
  • 32. CEFR-J Wordlist • “The textbook corpora consist of the major English textbooks used at primary to secondary schools (Years 3 to 10) in China, Korea, and Taiwan.” • “All the subcorpora were initially classi fi ed according to the CEFR levels, based on the analysis of the Course of Study provided in each country/region.” • Compared with the English Vocabulary Pro fi le. • “All the words have part-of-speech information and corresponding CEFR levels.” http://www.cefr-j.org/download.html#cefrj_wordlist
  • 34. SEWK-J • Scale of English Word Knowledge - Japanese • 74,810-word list created by Dr. Geoff Pinchbeck (Paper in preparation) • SEWK-J list estimates the likelihood that a word is known to Japanese university students. • "SEWK-J: Fine-grained” is for users interested in a detailed understanding of lower-level (more frequent) words’ pro fi ling.
  • 36. flemma List • For New JACET8000 and SVL12000, AntBNC Lemma List (by Dr. Laurence Anthony) is used. • AntBNC Lemma List = All words in the BNC corpus with a frequency greater than two. • Modi fi cations were manually made to match the headwords of New JACET8000 and SVL12000. 
 e.g., "interesting" and "interested" in both lists 
 — They were excluded from the lemma entry "interest" 
 (interest = interest, interests)
  • 37. flemma List • Headwords with British spellings in New JACET8000 and SVL12000 are included in the revised lemma list 
 e.g., advertise = advertise, advertised, advertises, advertising, advertize, advertizes, advertized, advertizing • For NGSL and SEWK-J, the lemma lists are provided by the word list developers.
  • 38. • POS prediction is based on pretrained statistical models (i.e., examples model has seen during training). • Accuracy of POS tagging = 97.05 % • Much faster than other modules in Python 
 (e.g., nltk). POS Tagging and Lemmatization For CEFR-J
  • 39. Examples of POS Tagging
  • 40. Proper Nouns and Numerals • In NWLC, proper nouns and numerals (numbers) are fi rst identi fi ed using spaCy. • Those are treated as possibly “known” words because they can be assumed to be understood by learners. • The possessive ’s (e.g., Todd’s dog) is also put into this category.
  • 41. A word Prop Noun / Number? Known YES NO Headword in the Selected List? YES NO Level In the (F)lemma list ? YES NO Level NA For CEFR-J, spaCy is used.
  • 43. Capitalized Letters = The three word lists are case-sensitive.
  • 44. Contracted Forms Words with Periods (i.e., abbreviated words)
  • 45. Hyphenated Words Osaka-based = Osaka and based (2 words)
  • 47. Development and Deployment • Python 
 • Flask 
 (micro web framework) 
 • js (Bootstrap) 
 + CSS
  • 50. 1 2 3
  • 51. • Unsupervised approac h • Better performance than other approaches
 (Campos et al., 2020)
 TF.IDF, KP-Miner, RAKE, TextRank, SingleRank, ExpandRank, TopicRank, TopicalPageRank, PositionRank, MultipartiteRank YAKE! https://github.com/LIAAD/yake (Yet Another Keyword Extractor)
  • 53. Outline • Word Lists • Word Pro fi lers • How to Count Words • Development of NWLC • Conclusion
  • 54. Why Counting a Word is an Issue? • Most stats are based on frequency (word counts) in a corpus. • Results may not be reproducible. • “95–98% coverage is necessary for reading and listening” — Do we really know which words can be counted or not?
  • 55. Suggestions • We should pay more attention to how we count words, including the word counting unit. • “replication crisis” (39% in Psychology, Open Science Collaboration, 2015) • Transparency and openness in science https://nobaproject.com/modules/the-replication-crisis-in-psychology
  • 56. Open Science Attempt in Another Field https://us.sagepub.com/en-us/nam/journal/research-politics#submission-guidelines
  • 57. Suggestions (Cont.) • Word list developers should establish consistent rules for choosing headwords. • Other considerations: Case-sensitive (e.g., March, May, US, IT), contracted forms, words with periods, hyphenated words, compounds? • They should also provide a (f)lemma list for their word list (or specify the POS-tagger).