SlideShare a Scribd company logo
Linguistic Hacking
Martin.Haase@uni-bamberg.de
maha@jabber.ccc.de
How to know what a text in an unknown
language is about?
1
Contents
• how to identify the language of a written text
• in traditional ways,
• with the help of computer technology.
• how to get at least some information out of
an unknown text.
2
“the intellectual challenge of creatively overcoming or
circumventing limitations.”
Hacking
Eric Raymond (1996):
The New Hackerʼs
Dictionary.
3
Spoken texts?
multi-language corpus of telephone calls
4
Writing Systems
• Roman (thousands of languages)
• Cyrillic (> 60 languages)
• Arabic (> 20 languages)
• Devana̅garī (> 10 languages, not
counting derivative writing systems)
• Hebrew (~ 3 – 5 languages)
5
Devanāgarī
!वनागरी
6
7
8
9
10
Hebrew
• Old and Modern Hebrew,
• Ladino (with different varieties),
• Judeo-Arabic,
• Yiddish.
11
12
13
Norman C. Ingle (1980):
Language Identification
Table. London: Technical
Translation International.
14
15
Computer-aided
identification
• frequencies of unique characters and
character strings
• common words recognition
• n-gram analysis
16
“Text”
17
( TE), (TEX), (EXT), (XT )
18
19
variant of the unique character string approach
20
compression efficiency
21
reference
model
22
reference
model
text in language
to be identified
+
23
reference
model
text in language
to be identified
+
gzip
24
reference
model
text in language
to be identified
+
gzip
compression efficiency
25
Interesting
applications
• measuring linguistic difference
> language families
• determining types of text
• spam detection?
26
• TextCat (http://odur.let.rug.nl/vannoord/TextCat/Demo/), n-gram
based, 76 languages, usable as a web application,
• Languid (http://languid.cantbedone.org/), downloadable
program, web application not running,
• Langid (http://complingone.georgetown.edu/∼langid/), n-gram
based, 65 languages, web application,
• LanguageGuesser (http://www.xrce.xerox.com/cgi-bin/mltt/
LanguageGuesser), frequency tests on characters and character
sequences, about 40 languages, web application,
• Polyglot 3000 (http://www.polyglot3000.com/), corpora, method
unknown, currently 441 languages, closed-source Windows
freeware. :-(
27
approaching “content analysis”
28
Hackerʼs approach
• numbers, dates, words from another
language
• typographic hints:
• bold or italic print,
• colored or underlined text chunks,
• capital letters
29
Zipfʼs law
Very frequent words are shorter and contain less
lexical information, whereas infrequent words are
longer and contain more lexical information.
30
less lexical information implies more grammatical
information and vice versa
31
most interesting for us:
words with more specific lexical information
32
Ignore all short words!
(even if they reiterate throughout the text)
33
Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i
gagana, ‘ua ‘avea ma gagana lona lua a le tele o
tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le
manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu
i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e
maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i
latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi,
‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua
leva ona atamamai ma popoto tagata Samoa e fai lo
latou soifua ma lo latou lalolagi.
34
Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i
gagana, ‘ua ‘avea ma gagana lona lua a le tele o
tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le
manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu
i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e
maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i
latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi,
‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua
leva ona atamamai ma popoto tagata Samoa e fai lo
latou soifua ma lo latou lalolagi.
35
Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i
gagana, ‘ua ‘avea ma gagana lona lua a le tele o
tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le
manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu
i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e
maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i
latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi,
‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua
leva ona atamamai ma popoto tagata Samoa e fai lo
latou soifua ma lo latou lalolagi.
36
Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i
gagana, ‘ua ‘avea ma gagana lona lua a le tele o
tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le
manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu
i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e
maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i
latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi,
‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua
leva ona atamamai ma popoto tagata Samoa e fai lo
latou soifua ma lo latou lalolagi.
37

More Related Content

What's hot

Endangered languages
Endangered languagesEndangered languages
Endangered languages
Rick McKinnon
 
Presentation endangered languages and linguistic diversity
Presentation endangered languages and linguistic diversityPresentation endangered languages and linguistic diversity
Presentation endangered languages and linguistic diversity
Zwidzai Chinyowa
 
Bilingualism
BilingualismBilingualism
Bilingualism
Brandon Torres
 
Language extinction
Language extinctionLanguage extinction
Language extinction
Aytekin Aliyeva
 
Meeting 6 multilingualism
Meeting 6 multilingualismMeeting 6 multilingualism
Meeting 6 multilingualism
School
 
Language Presentation 2013
Language Presentation 2013Language Presentation 2013
Language Presentation 2013
cindipatten
 
Languages of the world
Languages of the  worldLanguages of the  world
Languages of the world
David Deubelbeiss
 
Spanish in Texas: Open learning tools for exploring language diversity
Spanish in Texas: Open learning tools for exploring language diversitySpanish in Texas: Open learning tools for exploring language diversity
Spanish in Texas: Open learning tools for exploring language diversity
Spanish in Texas Project
 
Language and culture, s w hypothesis
Language and culture, s w hypothesisLanguage and culture, s w hypothesis
Language and culture, s w hypothesis
Shehnaz Mehboob
 
Languages of the word
Languages of the wordLanguages of the word
Languages of the word
Irma Nydia Villanueva
 
Multilingualism
Multilingualism Multilingualism
Multilingualism
moji azimi
 

What's hot (11)

Endangered languages
Endangered languagesEndangered languages
Endangered languages
 
Presentation endangered languages and linguistic diversity
Presentation endangered languages and linguistic diversityPresentation endangered languages and linguistic diversity
Presentation endangered languages and linguistic diversity
 
Bilingualism
BilingualismBilingualism
Bilingualism
 
Language extinction
Language extinctionLanguage extinction
Language extinction
 
Meeting 6 multilingualism
Meeting 6 multilingualismMeeting 6 multilingualism
Meeting 6 multilingualism
 
Language Presentation 2013
Language Presentation 2013Language Presentation 2013
Language Presentation 2013
 
Languages of the world
Languages of the  worldLanguages of the  world
Languages of the world
 
Spanish in Texas: Open learning tools for exploring language diversity
Spanish in Texas: Open learning tools for exploring language diversitySpanish in Texas: Open learning tools for exploring language diversity
Spanish in Texas: Open learning tools for exploring language diversity
 
Language and culture, s w hypothesis
Language and culture, s w hypothesisLanguage and culture, s w hypothesis
Language and culture, s w hypothesis
 
Languages of the word
Languages of the wordLanguages of the word
Languages of the word
 
Multilingualism
Multilingualism Multilingualism
Multilingualism
 

Similar to Martin Haase: Linguistic Hacking [24c3]

Development of language
Development of languageDevelopment of language
Development of language
Patrick White
 
Culture and language
Culture and languageCulture and language
Culture and language
Sairish khokhar
 
Bittinger & Hieber - Language revitalization: Issues with reference to Navajo
Bittinger & Hieber - Language revitalization: Issues with reference to NavajoBittinger & Hieber - Language revitalization: Issues with reference to Navajo
Bittinger & Hieber - Language revitalization: Issues with reference to Navajo
Daniel Hieber
 
Myth 2
Myth 2Myth 2
Myth 2
shekar1861
 
Culture and language01
Culture and language01Culture and language01
Culture and language01
Naseem Akhtar
 
what is Linguistics
what is Linguisticswhat is Linguistics
what is Linguistics
Muhammad Farooq
 
APHG Unit 3: Language
APHG Unit 3: LanguageAPHG Unit 3: Language
APHG Unit 3: Language
appleselena
 
The Future Of Native Languages
The Future Of Native LanguagesThe Future Of Native Languages
The Future Of Native Languages
angelicag315
 
What is english; what is language (online new)
What is english; what is language (online new)What is english; what is language (online new)
What is english; what is language (online new)
Simon Smith
 
Empres of the Word
Empres of the WordEmpres of the Word
Empres of the Word
shekar1861
 
Edl school-visit-en
Edl school-visit-enEdl school-visit-en
Edl school-visit-en
Fernanda Oliveira
 
European Day of Languages Quiz
European Day of Languages QuizEuropean Day of Languages Quiz
European Day of Languages Quiz
sarasacristan
 
Tamil People and Cultures
 Tamil People  and Cultures  Tamil People  and Cultures
[Challenge:Future] Language Death - The Language Box
[Challenge:Future] Language Death - The Language Box[Challenge:Future] Language Death - The Language Box
[Challenge:Future] Language Death - The Language Box
Challenge:Future
 
Myths and realities of samskrit
Myths and realities of samskritMyths and realities of samskrit
Myths and realities of samskrit
Open Pathshala
 
07 - Sociolinguistics.pdf based on sides
07 - Sociolinguistics.pdf based on sides07 - Sociolinguistics.pdf based on sides
07 - Sociolinguistics.pdf based on sides
JoseCotes7
 
English online-lesson-1-unit-6-dmtss
English online-lesson-1-unit-6-dmtssEnglish online-lesson-1-unit-6-dmtss
English online-lesson-1-unit-6-dmtss
Shagaishuu Xoo
 
English online-lesson-1-unit-6-dmtss
English online-lesson-1-unit-6-dmtssEnglish online-lesson-1-unit-6-dmtss
English online-lesson-1-unit-6-dmtss
Shagaishuu Xoo
 
A world of many languages.ppt
A world of many languages.pptA world of many languages.ppt
A world of many languages.ppt
NunoCosta359458
 
Meeting 13 language contact
Meeting 13 language contactMeeting 13 language contact
Meeting 13 language contact
School
 

Similar to Martin Haase: Linguistic Hacking [24c3] (20)

Development of language
Development of languageDevelopment of language
Development of language
 
Culture and language
Culture and languageCulture and language
Culture and language
 
Bittinger & Hieber - Language revitalization: Issues with reference to Navajo
Bittinger & Hieber - Language revitalization: Issues with reference to NavajoBittinger & Hieber - Language revitalization: Issues with reference to Navajo
Bittinger & Hieber - Language revitalization: Issues with reference to Navajo
 
Myth 2
Myth 2Myth 2
Myth 2
 
Culture and language01
Culture and language01Culture and language01
Culture and language01
 
what is Linguistics
what is Linguisticswhat is Linguistics
what is Linguistics
 
APHG Unit 3: Language
APHG Unit 3: LanguageAPHG Unit 3: Language
APHG Unit 3: Language
 
The Future Of Native Languages
The Future Of Native LanguagesThe Future Of Native Languages
The Future Of Native Languages
 
What is english; what is language (online new)
What is english; what is language (online new)What is english; what is language (online new)
What is english; what is language (online new)
 
Empres of the Word
Empres of the WordEmpres of the Word
Empres of the Word
 
Edl school-visit-en
Edl school-visit-enEdl school-visit-en
Edl school-visit-en
 
European Day of Languages Quiz
European Day of Languages QuizEuropean Day of Languages Quiz
European Day of Languages Quiz
 
Tamil People and Cultures
 Tamil People  and Cultures  Tamil People  and Cultures
Tamil People and Cultures
 
[Challenge:Future] Language Death - The Language Box
[Challenge:Future] Language Death - The Language Box[Challenge:Future] Language Death - The Language Box
[Challenge:Future] Language Death - The Language Box
 
Myths and realities of samskrit
Myths and realities of samskritMyths and realities of samskrit
Myths and realities of samskrit
 
07 - Sociolinguistics.pdf based on sides
07 - Sociolinguistics.pdf based on sides07 - Sociolinguistics.pdf based on sides
07 - Sociolinguistics.pdf based on sides
 
English online-lesson-1-unit-6-dmtss
English online-lesson-1-unit-6-dmtssEnglish online-lesson-1-unit-6-dmtss
English online-lesson-1-unit-6-dmtss
 
English online-lesson-1-unit-6-dmtss
English online-lesson-1-unit-6-dmtssEnglish online-lesson-1-unit-6-dmtss
English online-lesson-1-unit-6-dmtss
 
A world of many languages.ppt
A world of many languages.pptA world of many languages.ppt
A world of many languages.ppt
 
Meeting 13 language contact
Meeting 13 language contactMeeting 13 language contact
Meeting 13 language contact
 

More from OpenSlidesArchive

Paparazzi - The Free Autopilot [24c3]
Paparazzi - The Free Autopilot [24c3]Paparazzi - The Free Autopilot [24c3]
Paparazzi - The Free Autopilot [24c3]
OpenSlidesArchive
 
Peter Voigt: GPLv3 [24c3]
Peter Voigt: GPLv3 [24c3]Peter Voigt: GPLv3 [24c3]
Peter Voigt: GPLv3 [24c3]
OpenSlidesArchive
 
Frederik Ramm: OpenStreetMap, the free Wiki world map [24c3]
Frederik Ramm: OpenStreetMap, the free Wiki world map [24c3]Frederik Ramm: OpenStreetMap, the free Wiki world map [24c3]
Frederik Ramm: OpenStreetMap, the free Wiki world map [24c3]
OpenSlidesArchive
 
Olivier Cleynen: Overtaking Proprietary Software Without Writing Code [24c3]
Olivier Cleynen: Overtaking Proprietary Software Without Writing Code [24c3]Olivier Cleynen: Overtaking Proprietary Software Without Writing Code [24c3]
Olivier Cleynen: Overtaking Proprietary Software Without Writing Code [24c3]
OpenSlidesArchive
 
Arjen Kamphuis: Open Source Lobbying, tips from the trenches [24C3]
 Arjen Kamphuis: Open Source Lobbying, tips from the trenches [24C3] Arjen Kamphuis: Open Source Lobbying, tips from the trenches [24C3]
Arjen Kamphuis: Open Source Lobbying, tips from the trenches [24C3]
OpenSlidesArchive
 
The Arctic Cold War
The Arctic Cold WarThe Arctic Cold War
The Arctic Cold War
OpenSlidesArchive
 
Inside the Mac OS X Kernel
Inside the Mac OS X KernelInside the Mac OS X Kernel
Inside the Mac OS X Kernel
OpenSlidesArchive
 

More from OpenSlidesArchive (7)

Paparazzi - The Free Autopilot [24c3]
Paparazzi - The Free Autopilot [24c3]Paparazzi - The Free Autopilot [24c3]
Paparazzi - The Free Autopilot [24c3]
 
Peter Voigt: GPLv3 [24c3]
Peter Voigt: GPLv3 [24c3]Peter Voigt: GPLv3 [24c3]
Peter Voigt: GPLv3 [24c3]
 
Frederik Ramm: OpenStreetMap, the free Wiki world map [24c3]
Frederik Ramm: OpenStreetMap, the free Wiki world map [24c3]Frederik Ramm: OpenStreetMap, the free Wiki world map [24c3]
Frederik Ramm: OpenStreetMap, the free Wiki world map [24c3]
 
Olivier Cleynen: Overtaking Proprietary Software Without Writing Code [24c3]
Olivier Cleynen: Overtaking Proprietary Software Without Writing Code [24c3]Olivier Cleynen: Overtaking Proprietary Software Without Writing Code [24c3]
Olivier Cleynen: Overtaking Proprietary Software Without Writing Code [24c3]
 
Arjen Kamphuis: Open Source Lobbying, tips from the trenches [24C3]
 Arjen Kamphuis: Open Source Lobbying, tips from the trenches [24C3] Arjen Kamphuis: Open Source Lobbying, tips from the trenches [24C3]
Arjen Kamphuis: Open Source Lobbying, tips from the trenches [24C3]
 
The Arctic Cold War
The Arctic Cold WarThe Arctic Cold War
The Arctic Cold War
 
Inside the Mac OS X Kernel
Inside the Mac OS X KernelInside the Mac OS X Kernel
Inside the Mac OS X Kernel
 

Recently uploaded

Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
National Information Standards Organization (NISO)
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
imrankhan141184
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
Nguyen Thanh Tu Collection
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
Dr. Mulla Adam Ali
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Denish Jangid
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
S. Raj Kumar
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 

Recently uploaded (20)

Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 

Martin Haase: Linguistic Hacking [24c3]

  • 1. Linguistic Hacking Martin.Haase@uni-bamberg.de maha@jabber.ccc.de How to know what a text in an unknown language is about? 1
  • 2. Contents • how to identify the language of a written text • in traditional ways, • with the help of computer technology. • how to get at least some information out of an unknown text. 2
  • 3. “the intellectual challenge of creatively overcoming or circumventing limitations.” Hacking Eric Raymond (1996): The New Hackerʼs Dictionary. 3
  • 4. Spoken texts? multi-language corpus of telephone calls 4
  • 5. Writing Systems • Roman (thousands of languages) • Cyrillic (> 60 languages) • Arabic (> 20 languages) • Devana̅garī (> 10 languages, not counting derivative writing systems) • Hebrew (~ 3 – 5 languages) 5
  • 7. 7
  • 8. 8
  • 9. 9
  • 10. 10
  • 11. Hebrew • Old and Modern Hebrew, • Ladino (with different varieties), • Judeo-Arabic, • Yiddish. 11
  • 12. 12
  • 13. 13
  • 14. Norman C. Ingle (1980): Language Identification Table. London: Technical Translation International. 14
  • 15. 15
  • 16. Computer-aided identification • frequencies of unique characters and character strings • common words recognition • n-gram analysis 16
  • 18. ( TE), (TEX), (EXT), (XT ) 18
  • 19. 19
  • 20. variant of the unique character string approach 20
  • 23. reference model text in language to be identified + 23
  • 24. reference model text in language to be identified + gzip 24
  • 25. reference model text in language to be identified + gzip compression efficiency 25
  • 26. Interesting applications • measuring linguistic difference > language families • determining types of text • spam detection? 26
  • 27. • TextCat (http://odur.let.rug.nl/vannoord/TextCat/Demo/), n-gram based, 76 languages, usable as a web application, • Languid (http://languid.cantbedone.org/), downloadable program, web application not running, • Langid (http://complingone.georgetown.edu/∼langid/), n-gram based, 65 languages, web application, • LanguageGuesser (http://www.xrce.xerox.com/cgi-bin/mltt/ LanguageGuesser), frequency tests on characters and character sequences, about 40 languages, web application, • Polyglot 3000 (http://www.polyglot3000.com/), corpora, method unknown, currently 441 languages, closed-source Windows freeware. :-( 27
  • 29. Hackerʼs approach • numbers, dates, words from another language • typographic hints: • bold or italic print, • colored or underlined text chunks, • capital letters 29
  • 30. Zipfʼs law Very frequent words are shorter and contain less lexical information, whereas infrequent words are longer and contain more lexical information. 30
  • 31. less lexical information implies more grammatical information and vice versa 31
  • 32. most interesting for us: words with more specific lexical information 32
  • 33. Ignore all short words! (even if they reiterate throughout the text) 33
  • 34. Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i gagana, ‘ua ‘avea ma gagana lona lua a le tele o tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi, ‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua leva ona atamamai ma popoto tagata Samoa e fai lo latou soifua ma lo latou lalolagi. 34
  • 35. Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i gagana, ‘ua ‘avea ma gagana lona lua a le tele o tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi, ‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua leva ona atamamai ma popoto tagata Samoa e fai lo latou soifua ma lo latou lalolagi. 35
  • 36. Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i gagana, ‘ua ‘avea ma gagana lona lua a le tele o tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi, ‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua leva ona atamamai ma popoto tagata Samoa e fai lo latou soifua ma lo latou lalolagi. 36
  • 37. Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i gagana, ‘ua ‘avea ma gagana lona lua a le tele o tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi, ‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua leva ona atamamai ma popoto tagata Samoa e fai lo latou soifua ma lo latou lalolagi. 37