Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Innovations in Slovenian
(e-)lexicography:
from (semi-)automatic data
extraction to crowdsourcing
and beyond
Dr Iztok Kosem
Faculty of Arts, University of Ljubljana &
Centre for Applied Linguistics, Trojina Institute

Lexicographical process (Klosa, 2013)

Born-digital dictionaries
• ANW (Dictionary of Contemporary Dutch)
• 51079 entries (incl. partly complete entries)
• Innovative features (e.g. semagrams)
• Great Dictionary of Polish
• A great deal of manual work included (Zmigrodzki 2014)
• Immediate release of final entries
• 15,000 entries in 5 years (not many examples!)
• Estonian collocations dictionary (Kallas et al. 2015)
• Starting point: automatically extracted data
• Problems: examples extracted using a very general
configuration; missing collocation clustering etc.
• Publication of the entire dictionary at the end

Dictionary situation in Slovenia
• Last comprehensive dictionary of Slovene published in 1991
(with many entries older, from 70s and 80s)
• Based on material from late 19th century to 1970s
• dictionary database not accessible (also question marks about its
usefulness)
• Second edition published in 2014
• minor updates to the first edition (also opposing the conceptual
framework of the first version; Krek 2014; Ahlin et al 2014)
• online version requires a purchase of a printed version
• database is not available
• Dictionary publishing in general:
• Commercial publishers closing dictionary departments (no new
projects)
• General monolingual projects publicly funded

Dictionary of Contemporary
Slovene Language
• Challenges:
• Compiling a corpus-based dictionary from scratch, using
state-of-the-art lexicographic methods and theoretical
underpinnings
• Meeting needs of dictionary users (digital natives)
• Meeting the needs of NLP and language technology
communities
• Communication in Slovene (2008-2013)
• Gigafida corpus (1.2 billion words)
• New POS-tagger, parser and lexicon of word forms
• Slovene Lexical Database (Gantar et al. 2016)
• Testing new methods and approaches

Lexicography and automation
• Which parts of dictionary entry can be
(semi-)automatically extracted:
• List of words (e.g. terms)
• New words (Cook et al. 2013)
• Definitions (e.g. Pearson 1998; Pollak 2014)
• Some types of labels (Rundell & Kilgarriff 2011)
• Grammatical relations, collocations, multi-word
expressions (PARSEME COST Action)
• Corpus examples (Kosem et al. 2013; Gantar et al. 2016;
Cook et al. 2014)
11

authority (“manual” Sketch Grammar”)
35 gramrels
authority (automatic Sketch Grammar)
39 gramrels
19 gramrels with 92 multi-word links
(separate page)

“it is more efficient to edit out the
computer’s errors than to go through
the whole data-selection process from
the beginning”
(Rundell & Kilgarriff, 2011)
“too many choices early in the data-
selection process leave more room for
error”
(Kosem, Gantar & Krek, 2013)

Main (unproven) criticisms
• Automatic tools cannot replace lexicographers
• Important information can be missed
• Analysis is not as detailed and reliable as with the
manual approach
• Etc.
• Evaluation (Kosem et al. 2015)

SLD entries
coverage of
syntactic
structures
coverage of
collocates under
structures
nouns 82.40% 72.79%
adjectives 94.33% 75.80%
adverbs 92.78% 78.32%

• 100% coverage of all collocates:
• 12% of noun entries
• 8.4% of verb entries
• 16.4% of adjective entries
• 25% of adverb entries
• 100% coverage of collocates under syntactic structures:
• 9.7% of noun entries
• 22.5% of adverb entries
• 100% coverage of syntactic structures
• 35.4% of noun entries
• 82.5% of adverb entries.

Why not always 100%?
11.8.2015 Herstmonceux castle, eLex 2015
• Errors in SLD – a small amount (e.g. typos, wrong case
of collocate under certain syntactic structure)
• Different corpora and sketch grammars used
• Parameters for automatic extraction quite strict
• E.g. structure not exported if no collocates match the
minimum criteria  structure marked as not found by ADE
• On the other hand:
• Five to six times more collocates extracted
• Several syntactic structures in automatically extracted data,
which were not detected by lexicographers
• Several (good) examples match (more examples analysed)

Post-processing
• Tasks that are automated:
• Converting extracted data into the correct form (lemma
+ collocate)
• Removing duplicate examples
• Cleaning examples of noise (e.g. removing any extra
spaces before full stops and commas
• Assigning IDs of lemmas from the lexicon of word forms
• Other issues:
• False collocates (e.g. tagging problems)
• Incorrect examples (i.e. where the collocation does not
match the grammatical relation it belongs to)
• Grouping collocates, attributing them under senses, etc.

"Crowdsourcing" in lexicography:
(improving) the final product
(Abel & Meyer, 2013)

Crowdsourcing – dividing a complex
task into a series of simple ones
• Why is crowdsourcing needed in lexicography:
• challenges:
• lexicographers are facing increasing time constraints
& amounts of data
• lexicographers are overqualified for routine post-
editing of automatic procedures
• potential:
• non-expert individuals are talented, creative &
productive enough to solve such tasks
• modern technology makes using the potential of the
crowd simple, affordable & effective

Crowdsourcing - caveats
• estimate of the required investment wrt.
time, money & personnel is crucial
(should not take up more time &
resources than conventional methods)
• if fully integrated in the project,
microtasks can be designed according to
the same principles, use the same pre- &
post-processing chains & platforms
(economizing the initial investment)

Lessons learned
• Instructions must be clearly formulated and simple,
answers must not allow grading (only YES, NO, I
DON’T KNOW)
• not all automatically extracted data is suitable for
crowdsourcing:
• e.g. some grammatical relations are too complex for
evaluation
• users need to focus on some other objective:
competition, credits, money (micro payments)
• Gamification:
• examples: language games such as ESP Game (von Ahn,
2006) and Phrase Detectives (Chamberlain et al., 2008)

Lexicographical process of DCSL

DCSL – implementation and
future
• Meeting the needs of users
• Release of entries at each stage (thus, dictionary is
available from the start)
• Making the database available to NLP community,
researchers etc.
• A parallel project for testing and improving the first
stages of the procedure: Collocations dictionary of
Slovene

Thank you!
• Funded by Slovenian Research Agency project :
Koncept madžarsko-slovenskega slovarja: od
jezikovnega vira do uporabnika (V6-1509)

Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Similar to Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond (20)

Recently uploaded

Recently uploaded (20)

Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond