DUCKS in a Row

EPFL (École polytechnique fédérale de Lausanne)
EPFL (École polytechnique fédérale de Lausanne)Senior Scientist at EPFL and Executive Director at Kamusi Project
Martin Benjamin, Sina Mansour, Karl Aberer
DUCKS in a Row:
Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon
Vital Voices: Linking Language and Wellbeing
5th International Conference on Language Documentation and Conservation
Honolulu, Hawaii - March 5, 2017 1
2
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
kamusi is Swahili for dictionary
3
Goal: A complete matrix of human expression across time and space
• As a knowledge resource
• As a data resource
4
In service since 1994 (originally at Yale Council on African Studies)
International NGO since 2009
• Registered non-profit in USA and Switzerland
Academic Home since 2013:
EPFL - Swiss Federal Institute of Technology in Lausanne
LSIR - Distributed Systems Information Laboratory 5
White House Big Data Initiative:
Launch Partner for Building the Data Innovation Ecosystem
Networking and Information Technology R&D Program
Office of Science and Technology Policy 6
7
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
light
8
light
why multilingual dictionaries were impossible
9
light
lumineux
léger
allégé
léger
why multilingual dictionaries were impossible
10
light
lumineux
léger
allégé
léger
why multilingual dictionaries were impossible
WOLF 02121424-a:
léger
lumière
WOLF 01186408-a:
léger
WOLF 00993117-a:
léger
allégé
lumière
light
WOLF 00269989-a:
lumière
lumineux
clair
PWN (English Wordnet):
light x 47
WOLF (French Wordnet):
light = lumière x 44
light = léger x 37
léger = lumière x 36
11
light
léger
why multilingual dictionaries were impossible
lumineux
allégé
léger
12
why multilingual dictionaries were impossible
13
why multilingual dictionaries were impossible
lumineux
14
15
light
fr: lumineux
fr: léger
fr: allégé
fr: léger
why multilingual dictionaries were impossible
th: ที่แคลอรี่ต่ำ
fi: kaloriton
sw: pungufu
th: เบำ
fi: kevyt
sw: -epesi
th: สว่ำง
fi: valoisasw: -enye mwanga
th: ซึ่งไร้สำระ
fi: tyhjänpäiväinen
sw: -a kuchekesha
16
en: light
fr: lumineux
fr: léger
fr: allégé
fr: léger
why multilingual dictionaries were impossible
th: ที่แคลอรี่ต่ำ
fi: kaloriton
sw: pungufu
th: เบำ
fi: kevyt
sw: -epesi
th: สว่ำง
fi: valoisasw: -enye mwanga
th: ซึ่งไร้สำระ
fi: tyhjänpäiväinen
sw: -a kuchekesha
en: light
en: light
en: light
17
fr: lumineux
fr: léger
fr: allégé
why multilingual dictionaries were impossible
th: ที่แคลอรี่ต่ำ
fi: kaloriton
sw: pungufu
th: เบำ
fi: kevyt
sw: -epesi
th: สว่ำง
fi: valoisasw: -enye mwanga
light
fr: léger
th: ซึ่งไร้สำระ
fi: tyhjänpäiväinen
sw: -a kuchekesha
18
why multilingual dictionaries were impossible
19
light
how Kamusi makes a multilingual dictionary possible
20
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
21
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
fr: lumineux
fr: léger
fr: allégé
fr: léger
22
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
fr: lumineux th: สว่ำงfi: valoisasw: -enye mwanga
23
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
fr: léger th: เบำfi: kevytsw: -epesi
24
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
fr: léger th: ซึ่งไร้สำระfi: tyhjänpäiväinensw: -a kuchekesha
25
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
fr: allégé th: ที่แคลอรี่ต่ำfi: kaloritonsw: pungufu
26
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
fr: allégé th: ที่แคลอรี่ต่ำfi: kaloritonsw: pungufu
fr: léger th: ซึ่งไร้สำระfi: hölynpölysw: -a kuchekesha
fr: léger th: เบำfi: kevytsw: -epesi
fr: lumineux th: สว่ำงfi: valoisasw: -enye mwanga
27
how Kamusi makes a multilingual dictionary possible
light (not heavy) fr: léger th: เบำfi: kevytsw: -epesi
fr: léger (sandy)
fr: léger (low alcohol)
fr: léger (without much luggage)
28
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
29
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
30
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
31
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
32
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
33
light (not serious)
light (not fattening)
light (not heavy)
light (not dark)
how Kamusi makes a multilingual dictionary possible
34
35
Side by Side Vocabulary Translation:
KamusiGOLD vs. Google Translate
36
equivalence
• Parallel
• Similar
• Explanatory
translations 37
equivalence
• Parallel
• Similar
• Explanatory
hand (English) = main (French)
✓: transitive across languages
translations 38
equivalence
• Parallel
• Similar
• Explanatory
mkono (Swahili) = hand + arm (English)
⁇ : might be transitive across languages
mkono (Swahili) = lima (Hawaiian)
translations
difference difference translation
39
equivalence
• Parallel
• Similar
• Explanatory (Lexical Gaps)
hand (English) = 10.2 cm (most languages)
✗: not transitive across languages
translations 40
41
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
• over 100,000 English defined concepts from Princeton Wordnet
• Heavy Anglo-American bias
• James Cook yes, Kalaniʻōpuʻu-a-Kaiamamao no
• about 60 languages aligned via Global Wordnet
• over 800,000 English concepts from Wiktionary (in process)
• Wiktionary translations to many languages (highly problematic)
• other languages (Spanish already) can be pivots when:
• aligned to DUCKS
• entries have definitions
42
• Players match term from the left with
concept on the right
• Multiple matches possible
• Bad definitions can be flagged
• Null matches: on indigenous concepts:
43
Still to come:
•Multiplayer – validation by consensus
•Manual term searching
•DUCKS for FLEx
44
45
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
Switch flippable:
• About 60 wordlists and datasets data prepared and
permissions granted
• SignTyp set for ~20 sign languages
• Comparative African Wordlist – pre-aligned, no need for
the tool
• Any lexicon in a useable digital format that is copyright
available
46
What we can work with:
• Word lists
• part of speech is helpful
• Electronic versions of print dictionaries
• parse and play
• Digital dictionaries
• FLEx, etc
47
48
Kemedzung (Cameroon) from FLEx
49
Fula (West Africa)
ABADA Ar
abada, abadaa, abadan DFZ Z<->
never(F) (with negation); ever(F); long ago
jamais(D) (avec la négation)(Z); jamais; il y a longtemps
Abada mi yahaali. (F): I have never gone. ; Je ne suis jamais allé.
abada pati (F): don't ever ; ne faîtes jamais (qqch)
gila abada (F): since long ago, forever ; depuis longtemps, toujours
11,000 Fula senses
• 332 clearly computable matches
• ~7500 matches for DUCKS
• ~3000 null matches for manual follow-up
50
51
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
52
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
53
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
54
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
55
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
56
1. Multilingual lexicography:
problems and principles
2. DUCKS: Data Unified Concept
Knowledge Sets
a. The tool
b. The data
3. Your language in DUCKS
Martin Benjamin, Sina Mansour, Karl Aberer
DUCKS in a Row:
Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon
Vital Voices: Linking Language and Wellbeing
5th International Conference on Language Documentation and Conservation
Honolulu, Hawaii - March 5, 2017 57
1 of 57

More Related Content

More from EPFL (École polytechnique fédérale de Lausanne)(7)

Big Translations in Small SpacesBig Translations in Small Spaces
Big Translations in Small Spaces
EPFL (École polytechnique fédérale de Lausanne)238 views
Pre:D Say What You MeanPre:D Say What You Mean
Pre:D Say What You Mean
EPFL (École polytechnique fédérale de Lausanne)3.1K views
Live Localization with Kamusi Planet KLive Localization with Kamusi Planet K
Live Localization with Kamusi Planet K
EPFL (École polytechnique fédérale de Lausanne)248 views
Emoji International Name FinderEmoji International Name Finder
Emoji International Name Finder
EPFL (École polytechnique fédérale de Lausanne)967 views
Projet Complet: Paramètres Régionaux Pour 100 Langues AfricainesProjet Complet: Paramètres Régionaux Pour 100 Langues Africaines
Projet Complet: Paramètres Régionaux Pour 100 Langues Africaines
EPFL (École polytechnique fédérale de Lausanne)1.8K views
Achievement And Lessons Learned By An LocAchievement And Lessons Learned By An Loc
Achievement And Lessons Learned By An Loc
EPFL (École polytechnique fédérale de Lausanne)448 views
Completed Project: 100 African Language LocalesCompleted Project: 100 African Language Locales
Completed Project: 100 African Language Locales
EPFL (École polytechnique fédérale de Lausanne)2.7K views

Recently uploaded(20)

Sociology KS5Sociology KS5
Sociology KS5
WestHatch50 views
Education and Diversity.pptxEducation and Diversity.pptx
Education and Diversity.pptx
DrHafizKosar56 views
Streaming Quiz 2023.pdfStreaming Quiz 2023.pdf
Streaming Quiz 2023.pdf
Quiz Club NITW87 views
Nico Baumbach IMR Media ComponentNico Baumbach IMR Media Component
Nico Baumbach IMR Media Component
InMediaRes1186 views
BYSC infopack.pdfBYSC infopack.pdf
BYSC infopack.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego144 views
STERILITY TEST.pptxSTERILITY TEST.pptx
STERILITY TEST.pptx
Anupkumar Sharma102 views
ACTIVITY BOOK key water sports.pptxACTIVITY BOOK key water sports.pptx
ACTIVITY BOOK key water sports.pptx
Mar Caston Palacio132 views
Dance KS5 BreakdownDance KS5 Breakdown
Dance KS5 Breakdown
WestHatch52 views
Lecture: Open InnovationLecture: Open Innovation
Lecture: Open Innovation
Michal Hron82 views
SIMPLE PRESENT TENSE_new.pptxSIMPLE PRESENT TENSE_new.pptx
SIMPLE PRESENT TENSE_new.pptx
nisrinamadani2146 views
Psychology KS4Psychology KS4
Psychology KS4
WestHatch52 views
Class 10 English  lesson plansClass 10 English  lesson plans
Class 10 English lesson plans
Tariq KHAN172 views
2022 CAPE Merit List 2023 2022 CAPE Merit List 2023
2022 CAPE Merit List 2023
Caribbean Examinations Council3K views
231112 (WR) v1  ChatGPT OEB 2023.pdf231112 (WR) v1  ChatGPT OEB 2023.pdf
231112 (WR) v1 ChatGPT OEB 2023.pdf
WilfredRubens.com100 views
Universe revised.pdfUniverse revised.pdf
Universe revised.pdf
DrHafizKosar84 views

DUCKS in a Row

  • 1. Martin Benjamin, Sina Mansour, Karl Aberer DUCKS in a Row: Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon Vital Voices: Linking Language and Wellbeing 5th International Conference on Language Documentation and Conservation Honolulu, Hawaii - March 5, 2017 1
  • 2. 2 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 3. kamusi is Swahili for dictionary 3
  • 4. Goal: A complete matrix of human expression across time and space • As a knowledge resource • As a data resource 4
  • 5. In service since 1994 (originally at Yale Council on African Studies) International NGO since 2009 • Registered non-profit in USA and Switzerland Academic Home since 2013: EPFL - Swiss Federal Institute of Technology in Lausanne LSIR - Distributed Systems Information Laboratory 5
  • 6. White House Big Data Initiative: Launch Partner for Building the Data Innovation Ecosystem Networking and Information Technology R&D Program Office of Science and Technology Policy 6
  • 7. 7 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 11. light lumineux léger allégé léger why multilingual dictionaries were impossible WOLF 02121424-a: léger lumière WOLF 01186408-a: léger WOLF 00993117-a: léger allégé lumière light WOLF 00269989-a: lumière lumineux clair PWN (English Wordnet): light x 47 WOLF (French Wordnet): light = lumière x 44 light = léger x 37 léger = lumière x 36 11
  • 12. light léger why multilingual dictionaries were impossible lumineux allégé léger 12
  • 13. why multilingual dictionaries were impossible 13
  • 14. why multilingual dictionaries were impossible lumineux 14
  • 15. 15
  • 16. light fr: lumineux fr: léger fr: allégé fr: léger why multilingual dictionaries were impossible th: ที่แคลอรี่ต่ำ fi: kaloriton sw: pungufu th: เบำ fi: kevyt sw: -epesi th: สว่ำง fi: valoisasw: -enye mwanga th: ซึ่งไร้สำระ fi: tyhjänpäiväinen sw: -a kuchekesha 16
  • 17. en: light fr: lumineux fr: léger fr: allégé fr: léger why multilingual dictionaries were impossible th: ที่แคลอรี่ต่ำ fi: kaloriton sw: pungufu th: เบำ fi: kevyt sw: -epesi th: สว่ำง fi: valoisasw: -enye mwanga th: ซึ่งไร้สำระ fi: tyhjänpäiväinen sw: -a kuchekesha en: light en: light en: light 17
  • 18. fr: lumineux fr: léger fr: allégé why multilingual dictionaries were impossible th: ที่แคลอรี่ต่ำ fi: kaloriton sw: pungufu th: เบำ fi: kevyt sw: -epesi th: สว่ำง fi: valoisasw: -enye mwanga light fr: léger th: ซึ่งไร้สำระ fi: tyhjänpäiväinen sw: -a kuchekesha 18
  • 19. why multilingual dictionaries were impossible 19
  • 20. light how Kamusi makes a multilingual dictionary possible 20
  • 21. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 21
  • 22. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible fr: lumineux fr: léger fr: allégé fr: léger 22
  • 23. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible fr: lumineux th: สว่ำงfi: valoisasw: -enye mwanga 23
  • 24. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible fr: léger th: เบำfi: kevytsw: -epesi 24
  • 25. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible fr: léger th: ซึ่งไร้สำระfi: tyhjänpäiväinensw: -a kuchekesha 25
  • 26. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible fr: allégé th: ที่แคลอรี่ต่ำfi: kaloritonsw: pungufu 26
  • 27. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible fr: allégé th: ที่แคลอรี่ต่ำfi: kaloritonsw: pungufu fr: léger th: ซึ่งไร้สำระfi: hölynpölysw: -a kuchekesha fr: léger th: เบำfi: kevytsw: -epesi fr: lumineux th: สว่ำงfi: valoisasw: -enye mwanga 27
  • 28. how Kamusi makes a multilingual dictionary possible light (not heavy) fr: léger th: เบำfi: kevytsw: -epesi fr: léger (sandy) fr: léger (low alcohol) fr: léger (without much luggage) 28
  • 29. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 29
  • 30. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 30
  • 31. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 31
  • 32. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 32
  • 33. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 33
  • 34. light (not serious) light (not fattening) light (not heavy) light (not dark) how Kamusi makes a multilingual dictionary possible 34
  • 35. 35
  • 36. Side by Side Vocabulary Translation: KamusiGOLD vs. Google Translate 36
  • 37. equivalence • Parallel • Similar • Explanatory translations 37
  • 38. equivalence • Parallel • Similar • Explanatory hand (English) = main (French) ✓: transitive across languages translations 38
  • 39. equivalence • Parallel • Similar • Explanatory mkono (Swahili) = hand + arm (English) ⁇ : might be transitive across languages mkono (Swahili) = lima (Hawaiian) translations difference difference translation 39
  • 40. equivalence • Parallel • Similar • Explanatory (Lexical Gaps) hand (English) = 10.2 cm (most languages) ✗: not transitive across languages translations 40
  • 41. 41 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 42. • over 100,000 English defined concepts from Princeton Wordnet • Heavy Anglo-American bias • James Cook yes, Kalaniʻōpuʻu-a-Kaiamamao no • about 60 languages aligned via Global Wordnet • over 800,000 English concepts from Wiktionary (in process) • Wiktionary translations to many languages (highly problematic) • other languages (Spanish already) can be pivots when: • aligned to DUCKS • entries have definitions 42
  • 43. • Players match term from the left with concept on the right • Multiple matches possible • Bad definitions can be flagged • Null matches: on indigenous concepts: 43
  • 44. Still to come: •Multiplayer – validation by consensus •Manual term searching •DUCKS for FLEx 44
  • 45. 45 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 46. Switch flippable: • About 60 wordlists and datasets data prepared and permissions granted • SignTyp set for ~20 sign languages • Comparative African Wordlist – pre-aligned, no need for the tool • Any lexicon in a useable digital format that is copyright available 46
  • 47. What we can work with: • Word lists • part of speech is helpful • Electronic versions of print dictionaries • parse and play • Digital dictionaries • FLEx, etc 47
  • 49. 49 Fula (West Africa) ABADA Ar abada, abadaa, abadan DFZ Z<-> never(F) (with negation); ever(F); long ago jamais(D) (avec la négation)(Z); jamais; il y a longtemps Abada mi yahaali. (F): I have never gone. ; Je ne suis jamais allé. abada pati (F): don't ever ; ne faîtes jamais (qqch) gila abada (F): since long ago, forever ; depuis longtemps, toujours 11,000 Fula senses • 332 clearly computable matches • ~7500 matches for DUCKS • ~3000 null matches for manual follow-up
  • 50. 50
  • 51. 51 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 52. 52 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 53. 53 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 54. 54 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 55. 55 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 56. 56 1. Multilingual lexicography: problems and principles 2. DUCKS: Data Unified Concept Knowledge Sets a. The tool b. The data 3. Your language in DUCKS
  • 57. Martin Benjamin, Sina Mansour, Karl Aberer DUCKS in a Row: Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon Vital Voices: Linking Language and Wellbeing 5th International Conference on Language Documentation and Conservation Honolulu, Hawaii - March 5, 2017 57