Martin Haase: Linguistic Hacking [24c3]

326 views
272 views

Published on

A presentation by Martin Haase at the 24th Chaos Communication Congress, Berlin, December 2007.

http://events.ccc.de/congress/2007/Fahrplan/events/2284.en.html
http://www.youtube.com/watch?v=M3TYMmKxbqA
http://lanyrd.com/scgypk

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
326
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Martin Haase: Linguistic Hacking [24c3]

  1. 1. Linguistic HackingMartin.Haase@uni-bamberg.demaha@jabber.ccc.deHow to know what a text in an unknownlanguage is about?1
  2. 2. Contents• how to identify the language of a written text• in traditional ways,• with the help of computer technology.• how to get at least some information out ofan unknown text.2
  3. 3. “the intellectual challenge of creatively overcoming orcircumventing limitations.”HackingEric Raymond (1996):The New HackerʼsDictionary.3
  4. 4. Spoken texts?multi-language corpus of telephone calls4
  5. 5. Writing Systems• Roman (thousands of languages)• Cyrillic (> 60 languages)• Arabic (> 20 languages)• Devana̅garī (> 10 languages, notcounting derivative writing systems)• Hebrew (~ 3 – 5 languages)5
  6. 6. Devanāgarī!वनागरी6
  7. 7. 7
  8. 8. 8
  9. 9. 9
  10. 10. 10
  11. 11. Hebrew• Old and Modern Hebrew,• Ladino (with different varieties),• Judeo-Arabic,• Yiddish.11
  12. 12. 12
  13. 13. 13
  14. 14. Norman C. Ingle (1980):Language IdentificationTable. London: TechnicalTranslation International.14
  15. 15. 15
  16. 16. Computer-aidedidentification• frequencies of unique characters andcharacter strings• common words recognition• n-gram analysis16
  17. 17. “Text”17
  18. 18. ( TE), (TEX), (EXT), (XT )18
  19. 19. 19
  20. 20. variant of the unique character string approach20
  21. 21. compression efficiency21
  22. 22. referencemodel22
  23. 23. referencemodeltext in languageto be identified+23
  24. 24. referencemodeltext in languageto be identified+gzip24
  25. 25. referencemodeltext in languageto be identified+gzipcompression efficiency25
  26. 26. Interestingapplications• measuring linguistic difference> language families• determining types of text• spam detection?26
  27. 27. • TextCat (http://odur.let.rug.nl/vannoord/TextCat/Demo/), n-grambased, 76 languages, usable as a web application,• Languid (http://languid.cantbedone.org/), downloadableprogram, web application not running,• Langid (http://complingone.georgetown.edu/∼langid/), n-grambased, 65 languages, web application,• LanguageGuesser (http://www.xrce.xerox.com/cgi-bin/mltt/LanguageGuesser), frequency tests on characters and charactersequences, about 40 languages, web application,• Polyglot 3000 (http://www.polyglot3000.com/), corpora, methodunknown, currently 441 languages, closed-source Windowsfreeware. :-(27
  28. 28. approaching “content analysis”28
  29. 29. Hackerʼs approach• numbers, dates, words from anotherlanguage• typographic hints:• bold or italic print,• colored or underlined text chunks,• capital letters29
  30. 30. Zipfʼs lawVery frequent words are shorter and contain lesslexical information, whereas infrequent words arelonger and contain more lexical information.30
  31. 31. less lexical information implies more grammaticalinformation and vice versa31
  32. 32. most interesting for us:words with more specific lexical information32
  33. 33. Ignore all short words!(even if they reiterate throughout the text)33
  34. 34. Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘igagana, ‘ua ‘avea ma gagana lona lua a le tele otagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai lemanatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonui ai le tele o tagata Samoa e fa‘apea ‘o le gagana emaua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o ilatou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi,‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ualeva ona atamamai ma popoto tagata Samoa e fai lolatou soifua ma lo latou lalolagi.34
  35. 35. Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘igagana, ‘ua ‘avea ma gagana lona lua a le tele otagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai lemanatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonui ai le tele o tagata Samoa e fa‘apea ‘o le gagana emaua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o ilatou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi,‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ualeva ona atamamai ma popoto tagata Samoa e fai lolatou soifua ma lo latou lalolagi.35
  36. 36. Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘igagana, ‘ua ‘avea ma gagana lona lua a le tele otagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai lemanatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonui ai le tele o tagata Samoa e fa‘apea ‘o le gagana emaua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o ilatou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi,‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ualeva ona atamamai ma popoto tagata Samoa e fai lolatou soifua ma lo latou lalolagi.36
  37. 37. Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘igagana, ‘ua ‘avea ma gagana lona lua a le tele otagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai lemanatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonui ai le tele o tagata Samoa e fa‘apea ‘o le gagana emaua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o ilatou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi,‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ualeva ona atamamai ma popoto tagata Samoa e fai lolatou soifua ma lo latou lalolagi.37

×