SlideShare a Scribd company logo
1 of 22
Download to read offline
Innovations in Slovenian
(e-)lexicography:
from (semi-)automatic data
extraction to crowdsourcing
and beyond
Dr Iztok Kosem
Faculty of Arts, University of Ljubljana &
Centre for Applied Linguistics, Trojina Institute
Lexicographical process (Klosa, 2013)
Born-digital dictionaries
• ANW (Dictionary of Contemporary Dutch)
• 51079 entries (incl. partly complete entries)
• Innovative features (e.g. semagrams)
• Great Dictionary of Polish
• A great deal of manual work included (Zmigrodzki 2014)
• Immediate release of final entries
• 15,000 entries in 5 years (not many examples!)
• Estonian collocations dictionary (Kallas et al. 2015)
• Starting point: automatically extracted data
• Problems: examples extracted using a very general
configuration; missing collocation clustering etc.
• Publication of the entire dictionary at the end
Dictionary situation in Slovenia
• Last comprehensive dictionary of Slovene published in 1991
(with many entries older, from 70s and 80s)
• Based on material from late 19th century to 1970s
• dictionary database not accessible (also question marks about its
usefulness)
• Second edition published in 2014
• minor updates to the first edition (also opposing the conceptual
framework of the first version; Krek 2014; Ahlin et al 2014)
• online version requires a purchase of a printed version
• database is not available
• Dictionary publishing in general:
• Commercial publishers closing dictionary departments (no new
projects)
• General monolingual projects publicly funded
Dictionary of Contemporary
Slovene Language
• Challenges:
• Compiling a corpus-based dictionary from scratch, using
state-of-the-art lexicographic methods and theoretical
underpinnings
• Meeting needs of dictionary users (digital natives)
• Meeting the needs of NLP and language technology
communities
• Communication in Slovene (2008-2013)
• Gigafida corpus (1.2 billion words)
• New POS-tagger, parser and lexicon of word forms
• Slovene Lexical Database (Gantar et al. 2016)
• Testing new methods and approaches
Lexicography and automation
• Which parts of dictionary entry can be
(semi-)automatically extracted:
• List of words (e.g. terms)
• New words (Cook et al. 2013)
• Definitions (e.g. Pearson 1998; Pollak 2014)
• Some types of labels (Rundell & Kilgarriff 2011)
• Grammatical relations, collocations, multi-word
expressions (PARSEME COST Action)
• Corpus examples (Kosem et al. 2013; Gantar et al. 2016;
Cook et al. 2014)
11
authority (“manual” Sketch Grammar”)
35 gramrels
authority (automatic Sketch Grammar)
39 gramrels
19 gramrels with 92 multi-word links
(separate page)
“it is more efficient to edit out the
computer’s errors than to go through
the whole data-selection process from
the beginning”
(Rundell & Kilgarriff, 2011)
“too many choices early in the data-
selection process leave more room for
error”
(Kosem, Gantar & Krek, 2013)
Main (unproven) criticisms
• Automatic tools cannot replace lexicographers
• Important information can be missed
• Analysis is not as detailed and reliable as with the
manual approach
• Etc.
• Evaluation (Kosem et al. 2015)
SLD entries
coverage of
syntactic
structures
coverage of
collocates under
structures
nouns 82.40% 72.79%
adjectives 94.33% 75.80%
adverbs 92.78% 78.32%
• 100% coverage of all collocates:
• 12% of noun entries
• 8.4% of verb entries
• 16.4% of adjective entries
• 25% of adverb entries
• 100% coverage of collocates under syntactic structures:
• 9.7% of noun entries
• 18.5% of adjective entries
• 22.5% of adverb entries
• 100% coverage of syntactic structures
• 35.4% of noun entries
• 81.1% of adjective entries
• 82.5% of adverb entries.
Why not always 100%?
11.8.2015 Herstmonceux castle, eLex 2015
• Errors in SLD – a small amount (e.g. typos, wrong case
of collocate under certain syntactic structure)
• Different corpora and sketch grammars used
• Parameters for automatic extraction quite strict
• E.g. structure not exported if no collocates match the
minimum criteria  structure marked as not found by ADE
• On the other hand:
• Five to six times more collocates extracted
• Several syntactic structures in automatically extracted data,
which were not detected by lexicographers
• Several (good) examples match (more examples analysed)
Post-processing
• Tasks that are automated:
• Converting extracted data into the correct form (lemma
+ collocate)
• Removing duplicate examples
• Cleaning examples of noise (e.g. removing any extra
spaces before full stops and commas
• Assigning IDs of lemmas from the lexicon of word forms
• Other issues:
• False collocates (e.g. tagging problems)
• Incorrect examples (i.e. where the collocation does not
match the grammatical relation it belongs to)
• Grouping collocates, attributing them under senses, etc.
"Crowdsourcing" in lexicography:
(improving) the final product
(Abel & Meyer, 2013)
Crowdsourcing – dividing a complex
task into a series of simple ones
• Why is crowdsourcing needed in lexicography:
• challenges:
• lexicographers are facing increasing time constraints
& amounts of data
• lexicographers are overqualified for routine post-
editing of automatic procedures
• potential:
• non-expert individuals are talented, creative &
productive enough to solve such tasks
• modern technology makes using the potential of the
crowd simple, affordable & effective
Crowdsourcing - caveats
• estimate of the required investment wrt.
time, money & personnel is crucial
(should not take up more time &
resources than conventional methods)
• if fully integrated in the project,
microtasks can be designed according to
the same principles, use the same pre- &
post-processing chains & platforms
(economizing the initial investment)
Lessons learned
• Instructions must be clearly formulated and simple,
answers must not allow grading (only YES, NO, I
DON’T KNOW)
• not all automatically extracted data is suitable for
crowdsourcing:
• e.g. some grammatical relations are too complex for
evaluation
• users need to focus on some other objective:
competition, credits, money (micro payments)
• Gamification:
• examples: language games such as ESP Game (von Ahn,
2006) and Phrase Detectives (Chamberlain et al., 2008)
Lexicographical process of DCSL
DCSL – implementation and
future
• Meeting the needs of users
• Release of entries at each stage (thus, dictionary is
available from the start)
• Making the database available to NLP community,
researchers etc.
• A parallel project for testing and improving the first
stages of the procedure: Collocations dictionary of
Slovene
Thank you!
• Funded by Slovenian Research Agency project :
Koncept madžarsko-slovenskega slovarja: od
jezikovnega vira do uporabnika (V6-1509)

More Related Content

Viewers also liked

Lexicography 2011
Lexicography 2011Lexicography 2011
Lexicography 2011
Lenochka83
 
lexicography
lexicographylexicography
lexicography
ayfa
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
mimisy
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in English
teflang
 

Viewers also liked (11)

MoeDict: Crowd Lexicography
MoeDict: Crowd LexicographyMoeDict: Crowd Lexicography
MoeDict: Crowd Lexicography
 
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
 
The Dictionary of Food and Nutrition: a proposal for a new electronic multil...
The Dictionary of Food and Nutrition: a  proposal for a new electronic multil...The Dictionary of Food and Nutrition: a  proposal for a new electronic multil...
The Dictionary of Food and Nutrition: a proposal for a new electronic multil...
 
umair ijaz's Lexicography presentation
umair ijaz's Lexicography presentationumair ijaz's Lexicography presentation
umair ijaz's Lexicography presentation
 
Lexicography 2011
Lexicography 2011Lexicography 2011
Lexicography 2011
 
lexicography
lexicographylexicography
lexicography
 
Lexicography
LexicographyLexicography
Lexicography
 
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
 
Lexicography
 Lexicography Lexicography
Lexicography
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in English
 

Similar to Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

Presentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaPresentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_Ghoula
Nizar Ghoula
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Lifeng (Aaron) Han
 

Similar to Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond (20)

Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
 
Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Presentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaPresentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_Ghoula
 
Introduction
IntroductionIntroduction
Introduction
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 Neural Network Language Models for Candidate Scoring in Multi-System Machine... Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
 
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsNavigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
 
2012.11 - ISWC 2012 - DC - 1
2012.11 - ISWC 2012 - DC - 12012.11 - ISWC 2012 - DC - 1
2012.11 - ISWC 2012 - DC - 1
 
Filling the gaps
Filling the gapsFilling the gaps
Filling the gaps
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software Testing
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine Harvester
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 

Recently uploaded

Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 

Recently uploaded (20)

Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 

Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond

  • 1. Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond Dr Iztok Kosem Faculty of Arts, University of Ljubljana & Centre for Applied Linguistics, Trojina Institute
  • 3. Born-digital dictionaries • ANW (Dictionary of Contemporary Dutch) • 51079 entries (incl. partly complete entries) • Innovative features (e.g. semagrams) • Great Dictionary of Polish • A great deal of manual work included (Zmigrodzki 2014) • Immediate release of final entries • 15,000 entries in 5 years (not many examples!) • Estonian collocations dictionary (Kallas et al. 2015) • Starting point: automatically extracted data • Problems: examples extracted using a very general configuration; missing collocation clustering etc. • Publication of the entire dictionary at the end
  • 4. Dictionary situation in Slovenia • Last comprehensive dictionary of Slovene published in 1991 (with many entries older, from 70s and 80s) • Based on material from late 19th century to 1970s • dictionary database not accessible (also question marks about its usefulness) • Second edition published in 2014 • minor updates to the first edition (also opposing the conceptual framework of the first version; Krek 2014; Ahlin et al 2014) • online version requires a purchase of a printed version • database is not available • Dictionary publishing in general: • Commercial publishers closing dictionary departments (no new projects) • General monolingual projects publicly funded
  • 5. Dictionary of Contemporary Slovene Language • Challenges: • Compiling a corpus-based dictionary from scratch, using state-of-the-art lexicographic methods and theoretical underpinnings • Meeting needs of dictionary users (digital natives) • Meeting the needs of NLP and language technology communities • Communication in Slovene (2008-2013) • Gigafida corpus (1.2 billion words) • New POS-tagger, parser and lexicon of word forms • Slovene Lexical Database (Gantar et al. 2016) • Testing new methods and approaches
  • 6. Lexicography and automation • Which parts of dictionary entry can be (semi-)automatically extracted: • List of words (e.g. terms) • New words (Cook et al. 2013) • Definitions (e.g. Pearson 1998; Pollak 2014) • Some types of labels (Rundell & Kilgarriff 2011) • Grammatical relations, collocations, multi-word expressions (PARSEME COST Action) • Corpus examples (Kosem et al. 2013; Gantar et al. 2016; Cook et al. 2014) 11
  • 7.
  • 8. authority (“manual” Sketch Grammar”) 35 gramrels authority (automatic Sketch Grammar) 39 gramrels 19 gramrels with 92 multi-word links (separate page)
  • 9. “it is more efficient to edit out the computer’s errors than to go through the whole data-selection process from the beginning” (Rundell & Kilgarriff, 2011) “too many choices early in the data- selection process leave more room for error” (Kosem, Gantar & Krek, 2013)
  • 10. Main (unproven) criticisms • Automatic tools cannot replace lexicographers • Important information can be missed • Analysis is not as detailed and reliable as with the manual approach • Etc. • Evaluation (Kosem et al. 2015)
  • 11. SLD entries coverage of syntactic structures coverage of collocates under structures nouns 82.40% 72.79% adjectives 94.33% 75.80% adverbs 92.78% 78.32%
  • 12. • 100% coverage of all collocates: • 12% of noun entries • 8.4% of verb entries • 16.4% of adjective entries • 25% of adverb entries • 100% coverage of collocates under syntactic structures: • 9.7% of noun entries • 18.5% of adjective entries • 22.5% of adverb entries • 100% coverage of syntactic structures • 35.4% of noun entries • 81.1% of adjective entries • 82.5% of adverb entries.
  • 13. Why not always 100%? 11.8.2015 Herstmonceux castle, eLex 2015 • Errors in SLD – a small amount (e.g. typos, wrong case of collocate under certain syntactic structure) • Different corpora and sketch grammars used • Parameters for automatic extraction quite strict • E.g. structure not exported if no collocates match the minimum criteria  structure marked as not found by ADE • On the other hand: • Five to six times more collocates extracted • Several syntactic structures in automatically extracted data, which were not detected by lexicographers • Several (good) examples match (more examples analysed)
  • 14. Post-processing • Tasks that are automated: • Converting extracted data into the correct form (lemma + collocate) • Removing duplicate examples • Cleaning examples of noise (e.g. removing any extra spaces before full stops and commas • Assigning IDs of lemmas from the lexicon of word forms • Other issues: • False collocates (e.g. tagging problems) • Incorrect examples (i.e. where the collocation does not match the grammatical relation it belongs to) • Grouping collocates, attributing them under senses, etc.
  • 15. "Crowdsourcing" in lexicography: (improving) the final product (Abel & Meyer, 2013)
  • 16. Crowdsourcing – dividing a complex task into a series of simple ones • Why is crowdsourcing needed in lexicography: • challenges: • lexicographers are facing increasing time constraints & amounts of data • lexicographers are overqualified for routine post- editing of automatic procedures • potential: • non-expert individuals are talented, creative & productive enough to solve such tasks • modern technology makes using the potential of the crowd simple, affordable & effective
  • 17. Crowdsourcing - caveats • estimate of the required investment wrt. time, money & personnel is crucial (should not take up more time & resources than conventional methods) • if fully integrated in the project, microtasks can be designed according to the same principles, use the same pre- & post-processing chains & platforms (economizing the initial investment)
  • 18. Lessons learned • Instructions must be clearly formulated and simple, answers must not allow grading (only YES, NO, I DON’T KNOW) • not all automatically extracted data is suitable for crowdsourcing: • e.g. some grammatical relations are too complex for evaluation • users need to focus on some other objective: competition, credits, money (micro payments) • Gamification: • examples: language games such as ESP Game (von Ahn, 2006) and Phrase Detectives (Chamberlain et al., 2008)
  • 20.
  • 21. DCSL – implementation and future • Meeting the needs of users • Release of entries at each stage (thus, dictionary is available from the start) • Making the database available to NLP community, researchers etc. • A parallel project for testing and improving the first stages of the procedure: Collocations dictionary of Slovene
  • 22. Thank you! • Funded by Slovenian Research Agency project : Koncept madžarsko-slovenskega slovarja: od jezikovnega vira do uporabnika (V6-1509)