Natural Language Processing(SupStat Inc)

Is that Dothraki or Valyrian?
and other NLP tasks with Python and NLTK
Charlie Redmon | SupStat, Inc.
August 18, 2014

Importing raw text
dothraki_f = codecs.open(
"/home/cr/Python/westeros/dothraki.txt",
encoding=’utf -8’)
dothraki_raw = dothraki_f.read ()
print dothraki_raw
Athchomar chomakaan , [zhey] khal vezhven. Azha
anhaan asshilat ... Itte oakah! Jadi , zhey Jora
Andahli. Khal vezhven. Ajjalan anha zalat vitiherat
yer hatif. Kash qoy qoyi thira disse. Hash shafka
zali addrivat mae , zhey Khaleesi? Ishish chare
...

Text processing: Cleaning
punct_re = re.compile(
ur’[. ,;:?! u2014u2019u2026 []] ’,
re.UNICODE)
dothraki_proc = punct_re.sub(’’, dothraki_raw)
dothraki_proc = dothraki_proc.lower ()
print dothraki_proc
athchomar chomakaan zhey khal vezhven azha anhaan
asshilat itte oakah jadi zhey jora andahli khal
vezhven ajjalan anha zalat vitiherat yer hatif kash
qoy qoyi thira disse
...

Text processing: Tokenizing
dothraki_tokens = re.split(ur’s+’, dothraki_proc)
dothraki_types = set(dothraki_tokens )
print dothraki_types
set([u’izzi ’, u’ale’, u’morea ’, u’vesazhao ’,
u’yeri ’, u’ishish ’, u’dalen ’, u’vesazhae ’, u’yera ’,
u’afisi ’, u’rhae ’, u’mawizzi ’, u’vee’, u’arrisse ’,
u’ti’, u’ven’, u’rizh ’, u’afichak ’, u’gache ’,
u’zigerek ’, u’zigereo ’, u’drivoe ’, u’maaz ’,
u’zigeree ’, u’ayyeyoon ’, u’maan ’, u’mahrazhi ’,
u’ma’, u’vos’, u’movekkhi ’, u’mahrazhis ’,
u’meshafka ’, u’qisi ’, u’sani ’, u’ville ’, u’vikeesi ’,
u’ifak ’, u’javrathi ’, u’zisa ’, u’chek ’, u’nem’,
...
])

Inspecting the lexical distribution in a text
dothraki_freqdist = FreqDist( dothraki_tokens)
print dothraki_freqdist
<FreqDist: u’anha ’: 50, u’vos’: 40, u’me’: 39,
u’ma’: 38, u’zhey ’: 29, u’mae’: 27, u’anni ’: 26,
u’hash ’: 23, u’yer’: 23, u’khal ’: 16,
u’khaleesi ’: 16, u’mori ’: 15, u’jin’: 13,
u’kisha ’: 12, u’nem’: 11, u’vo’: 11, u’che’: 10,
u’jini ’: 10, u’she’: 10, ... >
dothraki_freqdist .plot (20, cumulative=True)

CFD of Dothraki words
Top 10: anha, vos, me, ma, zhey, mae, anni, hash, yer, khal

Valyrian vocabulary distribution
Astapori Valyrian (Top 10):
ji, me, do, espo, si, mysa, eji, ez, ivetr´a, sa
High Valyrian (Top 10):
daor, se, issa, syt, ziry, hen, jem¯ele, lue, yne, avy

Feature 1: Consonant proportion
def c_prop(word ):
c_num = 0
for letter in u’bcdfgjklmnpqrstvxz u00f1 ’:
c_num += word.count(letter)
return c_num / len(word)
c_prop(u’zu016bgusy ’)
0.5

Word-internal consonant proportions across languages

Feature 2: Obstruent proportion
def obstruent_prop (word ):
obstruent_num = 0
for letter in u’bcdfgjkpqstvxz ’
obstruent_num += word.count(letter)
return obstruent_num / len(word)
obstruent_prop (u’u012blvi ’)
0.25

Word-internal obstruent proportions across languages

Feature 3: Coda presence
def c_coda(word ):
if word [-1] in u’bcdfgjklmnpqrstvxz u00f1 ’:
return 1
else:
return 0
def obstruent_coda (word ):
if word [-1] in u’bcdfgjkpqstvxz ’:
return 1
else:
return 0
c_coda(u’lysoon ’)
1
obstruent_coda (u’lysoon ’)
0

Mean coda consonant presence across languages

Mean coda obstruent presence across languages

Feature 4: Consonant clusters
regex = ur’[ bcdfghjklmnpqrstvxz u00f1]
[ bcdfghjklmnpqrstvxz u00f1 ]+’
def c_cluster(word ):
cc_set = re.findall(regex , word , re.UNICODE)
return len(cc_set)
c_cluster(u’avvirsosh ’)
3

Mean consonant cluster frequency across languages

Feature 5: Obstruent clusters
regex1 = ur’[bcdfghjkpqstvxz ][ bcdfghjkpqstvxz ]+’
def obs_cluster(word ):
oo_set = re.findall(regex1 , word , re.UNICODE)
return len(oo_set)
obs_cluster(u’avvirsosh ’)
2

Mean obstruent cluster frequency across languages

Feature 6: Vowel clusters
regex2 = ur’[ bcdfghjklmnpqrstvxz u00f1 ]+’
def v_cluster(word ):
v_set = re.split(regex2 , word , re.UNICODE)
vv_set = [v for v in v_set if len(v) > 1]
return len(vv_set)
v_cluster(u’haeshi ’)
1

Mean vowel cluster frequency across languages

Assamese corpus ﬁles
directory = "/home/cr/Documents/NLPwP_pres/
TDIL_assamese_corpus_data "
os.listdir(directory)
[’subj_art2.txt’, ’subj_politics1 .txt’, ’lit3.txt’,
’drama.txt’, ’religion2.txt’, ’criticism2.txt’,
’criticism1.txt’, ’subj_science3.txt’,
’ref_encyclopaedia -entry.txt’, ’subj_science2.txt’,
’subj_social -studies.txt’, ’music.txt’, ’subj_art1.txt
’subj_science1.txt’, ’subj_art3.txt’, ’news.txt’,
’subj_sociology .txt’, ’criticism3.txt’, ’lit8.txt’,
’subj_history1.txt’, ’lit4.txt’, ’lit6.txt’, ’religion
’subj_law.txt’, ’lit7.txt’, ’religion1.txt’, ’criticis
’lit5.txt’, ’subj_math.txt’, ’lit1.txt’, ’subj_science
’subj_science_5 .txt’, ’subj_history2.txt’, ’lit2.txt’,
’subj_science4.txt’, ’letter.txt’]

Assamese sample: ‘lit5.txt’

Frequency of the sound /x/ in ’lit5.txt’
len(re.findall(ur’[ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
1313
len(re.findall(ur’u09b6 ’, assamese_sample_raw ,
re.UNICODE ))
298
re.UNICODE ))
195
re.UNICODE ))
820

Positional restrictions
Beginning a word:
len(re.findall(ur’b[ u09b6u09b7u09b8]’,
1129
Ending a word:
len(re.findall(ur’[ u09b6u09b7u09b8 ]b’,
895

Positional restrictions
Following /a/:
len(re.findall(ur’u09be [ u09b6u09b7u09b8]’,
57
Following /i/:
len(re.findall(ur’[ u09bfu09c0 ][ u09b6u09b7u09b8]’,
ssamese_sample_raw , re.UNICODE ))
70
Following /u/:
len(re.findall(ur’[ u09c1u09c2 ][ u09b6u09b7u09b8]’,
10

Further work
Incorporate segmental parameters into classifier (fix Unicode
issues with NLTK’s classify module)
Use classifier to predict assignment of random words from
Westeros to Dothraki, Astapori Valyrian, and High Valyrian
languages
Isolate most important word-internal parameters in
classification model (log-likelihood ranking in Naive Bayes
model)
Use full distributional account of select Assamese consonants
as priors in acoustic classification model

Natural Language Processing(SupStat Inc)

More Related Content

What's hot

Viewers also liked

More from Vivian S. Zhang

Recently uploaded

Natural Language Processing(SupStat Inc)