Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon. Presentation at the 5th International Conference on Language Documentation and Conservation (ICLDC), Honolulu, 2017
1. Martin Benjamin, Sina Mansour, Karl Aberer
DUCKS in a Row:
Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon
Vital Voices: Linking Language and Wellbeing
5th International Conference on Language Documentation and Conservation
Honolulu, Hawaii - March 5, 2017 1
4. Goal: A complete matrix of human expression across time and space
• As a knowledge resource
• As a data resource
4
5. In service since 1994 (originally at Yale Council on African Studies)
International NGO since 2009
• Registered non-profit in USA and Switzerland
Academic Home since 2013:
EPFL - Swiss Federal Institute of Technology in Lausanne
LSIR - Distributed Systems Information Laboratory 5
6. White House Big Data Initiative:
Launch Partner for Building the Data Innovation Ecosystem
Networking and Information Technology R&D Program
Office of Science and Technology Policy 6
39. equivalence
• Parallel
• Similar
• Explanatory
mkono (Swahili) = hand + arm (English)
⁇ : might be transitive across languages
mkono (Swahili) = lima (Hawaiian)
translations
difference difference translation
39
40. equivalence
• Parallel
• Similar
• Explanatory (Lexical Gaps)
hand (English) = 10.2 cm (most languages)
✗: not transitive across languages
translations 40
42. • over 100,000 English defined concepts from Princeton Wordnet
• Heavy Anglo-American bias
• James Cook yes, Kalaniʻōpuʻu-a-Kaiamamao no
• about 60 languages aligned via Global Wordnet
• over 800,000 English concepts from Wiktionary (in process)
• Wiktionary translations to many languages (highly problematic)
• other languages (Spanish already) can be pivots when:
• aligned to DUCKS
• entries have definitions
42
43. • Players match term from the left with
concept on the right
• Multiple matches possible
• Bad definitions can be flagged
• Null matches: on indigenous concepts:
43
46. Switch flippable:
• About 60 wordlists and datasets data prepared and
permissions granted
• SignTyp set for ~20 sign languages
• Comparative African Wordlist – pre-aligned, no need for
the tool
• Any lexicon in a useable digital format that is copyright
available
46
47. What we can work with:
• Word lists
• part of speech is helpful
• Electronic versions of print dictionaries
• parse and play
• Digital dictionaries
• FLEx, etc
47
49. 49
Fula (West Africa)
ABADA Ar
abada, abadaa, abadan DFZ Z<->
never(F) (with negation); ever(F); long ago
jamais(D) (avec la négation)(Z); jamais; il y a longtemps
Abada mi yahaali. (F): I have never gone. ; Je ne suis jamais allé.
abada pati (F): don't ever ; ne faîtes jamais (qqch)
gila abada (F): since long ago, forever ; depuis longtemps, toujours
11,000 Fula senses
• 332 clearly computable matches
• ~7500 matches for DUCKS
• ~3000 null matches for manual follow-up
57. Martin Benjamin, Sina Mansour, Karl Aberer
DUCKS in a Row:
Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon
Vital Voices: Linking Language and Wellbeing
5th International Conference on Language Documentation and Conservation
Honolulu, Hawaii - March 5, 2017 57