Is that Dothraki or Valyrian?
and other NLP tasks with Python and NLTK
Charlie Redmon | SupStat, Inc.
August 18, 2014
Dothraki
Astapori Valyrian
High Valyrian
Importing raw text
dothraki_f = codecs.open(
"/home/cr/Python/westeros/dothraki.txt",
encoding=’utf -8’)
dothraki_raw = dothraki_f.read ()
print dothraki_raw
Athchomar chomakaan , [zhey] khal vezhven. Azha
anhaan asshilat ... Itte oakah! Jadi , zhey Jora
Andahli. Khal vezhven. Ajjalan anha zalat vitiherat
yer hatif. Kash qoy qoyi thira disse. Hash shafka
zali addrivat mae , zhey Khaleesi? Ishish chare
...
Text processing: Cleaning
punct_re = re.compile(
ur’[. ,;:?! u2014u2019u2026 []] ’,
re.UNICODE)
dothraki_proc = punct_re.sub(’’, dothraki_raw)
dothraki_proc = dothraki_proc.lower ()
print dothraki_proc
athchomar chomakaan zhey khal vezhven azha anhaan
asshilat itte oakah jadi zhey jora andahli khal
vezhven ajjalan anha zalat vitiherat yer hatif kash
qoy qoyi thira disse
...
Text processing: Tokenizing
dothraki_tokens = re.split(ur’s+’, dothraki_proc)
dothraki_types = set(dothraki_tokens )
print dothraki_types
set([u’izzi ’, u’ale’, u’morea ’, u’vesazhao ’,
u’yeri ’, u’ishish ’, u’dalen ’, u’vesazhae ’, u’yera ’,
u’afisi ’, u’rhae ’, u’mawizzi ’, u’vee’, u’arrisse ’,
u’ti’, u’ven’, u’rizh ’, u’afichak ’, u’gache ’,
u’zigerek ’, u’zigereo ’, u’drivoe ’, u’maaz ’,
u’zigeree ’, u’ayyeyoon ’, u’maan ’, u’mahrazhi ’,
u’ma’, u’vos’, u’movekkhi ’, u’mahrazhis ’,
u’meshafka ’, u’qisi ’, u’sani ’, u’ville ’, u’vikeesi ’,
u’ifak ’, u’javrathi ’, u’zisa ’, u’chek ’, u’nem’,
...
])
Inspecting the lexical distribution in a text
dothraki_freqdist = FreqDist( dothraki_tokens)
print dothraki_freqdist
<FreqDist: u’anha ’: 50, u’vos’: 40, u’me’: 39,
u’ma’: 38, u’zhey ’: 29, u’mae’: 27, u’anni ’: 26,
u’hash ’: 23, u’yer’: 23, u’khal ’: 16,
u’khaleesi ’: 16, u’mori ’: 15, u’jin’: 13,
u’kisha ’: 12, u’nem’: 11, u’vo’: 11, u’che’: 10,
u’jini ’: 10, u’she’: 10, ... >
dothraki_freqdist .plot (20, cumulative=True)
CFD of Dothraki words
Top 10: anha, vos, me, ma, zhey, mae, anni, hash, yer, khal
Valyrian vocabulary distribution
Astapori Valyrian (Top 10):
ji, me, do, espo, si, mysa, eji, ez, ivetr´a, sa
High Valyrian (Top 10):
daor, se, issa, syt, ziry, hen, jem¯ele, lue, yne, avy
Feature 1: Consonant proportion
def c_prop(word ):
c_num = 0
for letter in u’bcdfgjklmnpqrstvxz u00f1 ’:
c_num += word.count(letter)
return c_num / len(word)
c_prop(u’zu016bgusy ’)
0.5
Word-internal consonant proportions across languages
Feature 2: Obstruent proportion
def obstruent_prop (word ):
obstruent_num = 0
for letter in u’bcdfgjkpqstvxz ’
obstruent_num += word.count(letter)
return obstruent_num / len(word)
obstruent_prop (u’u012blvi ’)
0.25
Word-internal obstruent proportions across languages
Feature 3: Coda presence
def c_coda(word ):
if word [-1] in u’bcdfgjklmnpqrstvxz u00f1 ’:
return 1
else:
return 0
def obstruent_coda (word ):
if word [-1] in u’bcdfgjkpqstvxz ’:
return 1
else:
return 0
c_coda(u’lysoon ’)
1
obstruent_coda (u’lysoon ’)
0
Mean coda consonant presence across languages
Mean coda obstruent presence across languages
Feature 4: Consonant clusters
regex = ur’[ bcdfghjklmnpqrstvxz u00f1]
[ bcdfghjklmnpqrstvxz u00f1 ]+’
def c_cluster(word ):
cc_set = re.findall(regex , word , re.UNICODE)
return len(cc_set)
c_cluster(u’avvirsosh ’)
3
Mean consonant cluster frequency across languages
Feature 5: Obstruent clusters
regex1 = ur’[bcdfghjkpqstvxz ][ bcdfghjkpqstvxz ]+’
def obs_cluster(word ):
oo_set = re.findall(regex1 , word , re.UNICODE)
return len(oo_set)
obs_cluster(u’avvirsosh ’)
2
Mean obstruent cluster frequency across languages
Feature 6: Vowel clusters
regex2 = ur’[ bcdfghjklmnpqrstvxz u00f1 ]+’
def v_cluster(word ):
v_set = re.split(regex2 , word , re.UNICODE)
vv_set = [v for v in v_set if len(v) > 1]
return len(vv_set)
v_cluster(u’haeshi ’)
1
Mean vowel cluster frequency across languages
Data from real languages
TDIL Assamese Corpus
TDIL Assamese Corpus
Assamese corpus files
directory = "/home/cr/Documents/NLPwP_pres/
TDIL_assamese_corpus_data "
os.listdir(directory)
[’subj_art2.txt’, ’subj_politics1 .txt’, ’lit3.txt’,
’drama.txt’, ’religion2.txt’, ’criticism2.txt’,
’criticism1.txt’, ’subj_science3.txt’,
’ref_encyclopaedia -entry.txt’, ’subj_science2.txt’,
’subj_social -studies.txt’, ’music.txt’, ’subj_art1.txt
’subj_science1.txt’, ’subj_art3.txt’, ’news.txt’,
’subj_sociology .txt’, ’criticism3.txt’, ’lit8.txt’,
’subj_history1.txt’, ’lit4.txt’, ’lit6.txt’, ’religion
’subj_law.txt’, ’lit7.txt’, ’religion1.txt’, ’criticis
’lit5.txt’, ’subj_math.txt’, ’lit1.txt’, ’subj_science
’subj_science_5 .txt’, ’subj_history2.txt’, ’lit2.txt’,
’subj_science4.txt’, ’letter.txt’]
Assamese sample: ‘lit5.txt’
Frequency of the sound /x/ in ’lit5.txt’
len(re.findall(ur’[ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
1313
len(re.findall(ur’u09b6 ’, assamese_sample_raw ,
re.UNICODE ))
298
len(re.findall(ur’u09b7 ’, assamese_sample_raw ,
re.UNICODE ))
195
len(re.findall(ur’u09b8 ’, assamese_sample_raw ,
re.UNICODE ))
820
Positional restrictions
Beginning a word:
len(re.findall(ur’b[ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
1129
Ending a word:
len(re.findall(ur’[ u09b6u09b7u09b8 ]b’,
assamese_sample_raw , re.UNICODE ))
895
Positional restrictions
Following /a/:
len(re.findall(ur’u09be [ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
57
Following /i/:
len(re.findall(ur’[ u09bfu09c0 ][ u09b6u09b7u09b8]’,
ssamese_sample_raw , re.UNICODE ))
70
Following /u/:
len(re.findall(ur’[ u09c1u09c2 ][ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
10
Further work
Incorporate segmental parameters into classifier (fix Unicode
issues with NLTK’s classify module)
Use classifier to predict assignment of random words from
Westeros to Dothraki, Astapori Valyrian, and High Valyrian
languages
Isolate most important word-internal parameters in
classification model (log-likelihood ranking in Naive Bayes
model)
Use full distributional account of select Assamese consonants
as priors in acoustic classification model
Thank you

Natural Language Processing(SupStat Inc)

  • 1.
    Is that Dothrakior Valyrian? and other NLP tasks with Python and NLTK Charlie Redmon | SupStat, Inc. August 18, 2014
  • 2.
  • 3.
  • 4.
  • 5.
    Importing raw text dothraki_f= codecs.open( "/home/cr/Python/westeros/dothraki.txt", encoding=’utf -8’) dothraki_raw = dothraki_f.read () print dothraki_raw Athchomar chomakaan , [zhey] khal vezhven. Azha anhaan asshilat ... Itte oakah! Jadi , zhey Jora Andahli. Khal vezhven. Ajjalan anha zalat vitiherat yer hatif. Kash qoy qoyi thira disse. Hash shafka zali addrivat mae , zhey Khaleesi? Ishish chare ...
  • 6.
    Text processing: Cleaning punct_re= re.compile( ur’[. ,;:?! u2014u2019u2026 []] ’, re.UNICODE) dothraki_proc = punct_re.sub(’’, dothraki_raw) dothraki_proc = dothraki_proc.lower () print dothraki_proc athchomar chomakaan zhey khal vezhven azha anhaan asshilat itte oakah jadi zhey jora andahli khal vezhven ajjalan anha zalat vitiherat yer hatif kash qoy qoyi thira disse ...
  • 7.
    Text processing: Tokenizing dothraki_tokens= re.split(ur’s+’, dothraki_proc) dothraki_types = set(dothraki_tokens ) print dothraki_types set([u’izzi ’, u’ale’, u’morea ’, u’vesazhao ’, u’yeri ’, u’ishish ’, u’dalen ’, u’vesazhae ’, u’yera ’, u’afisi ’, u’rhae ’, u’mawizzi ’, u’vee’, u’arrisse ’, u’ti’, u’ven’, u’rizh ’, u’afichak ’, u’gache ’, u’zigerek ’, u’zigereo ’, u’drivoe ’, u’maaz ’, u’zigeree ’, u’ayyeyoon ’, u’maan ’, u’mahrazhi ’, u’ma’, u’vos’, u’movekkhi ’, u’mahrazhis ’, u’meshafka ’, u’qisi ’, u’sani ’, u’ville ’, u’vikeesi ’, u’ifak ’, u’javrathi ’, u’zisa ’, u’chek ’, u’nem’, ... ])
  • 8.
    Inspecting the lexicaldistribution in a text dothraki_freqdist = FreqDist( dothraki_tokens) print dothraki_freqdist <FreqDist: u’anha ’: 50, u’vos’: 40, u’me’: 39, u’ma’: 38, u’zhey ’: 29, u’mae’: 27, u’anni ’: 26, u’hash ’: 23, u’yer’: 23, u’khal ’: 16, u’khaleesi ’: 16, u’mori ’: 15, u’jin’: 13, u’kisha ’: 12, u’nem’: 11, u’vo’: 11, u’che’: 10, u’jini ’: 10, u’she’: 10, ... > dothraki_freqdist .plot (20, cumulative=True)
  • 9.
    CFD of Dothrakiwords Top 10: anha, vos, me, ma, zhey, mae, anni, hash, yer, khal
  • 10.
    Valyrian vocabulary distribution AstaporiValyrian (Top 10): ji, me, do, espo, si, mysa, eji, ez, ivetr´a, sa High Valyrian (Top 10): daor, se, issa, syt, ziry, hen, jem¯ele, lue, yne, avy
  • 11.
    Feature 1: Consonantproportion def c_prop(word ): c_num = 0 for letter in u’bcdfgjklmnpqrstvxz u00f1 ’: c_num += word.count(letter) return c_num / len(word) c_prop(u’zu016bgusy ’) 0.5
  • 12.
  • 13.
    Feature 2: Obstruentproportion def obstruent_prop (word ): obstruent_num = 0 for letter in u’bcdfgjkpqstvxz ’ obstruent_num += word.count(letter) return obstruent_num / len(word) obstruent_prop (u’u012blvi ’) 0.25
  • 14.
  • 15.
    Feature 3: Codapresence def c_coda(word ): if word [-1] in u’bcdfgjklmnpqrstvxz u00f1 ’: return 1 else: return 0 def obstruent_coda (word ): if word [-1] in u’bcdfgjkpqstvxz ’: return 1 else: return 0 c_coda(u’lysoon ’) 1 obstruent_coda (u’lysoon ’) 0
  • 16.
    Mean coda consonantpresence across languages
  • 17.
    Mean coda obstruentpresence across languages
  • 18.
    Feature 4: Consonantclusters regex = ur’[ bcdfghjklmnpqrstvxz u00f1] [ bcdfghjklmnpqrstvxz u00f1 ]+’ def c_cluster(word ): cc_set = re.findall(regex , word , re.UNICODE) return len(cc_set) c_cluster(u’avvirsosh ’) 3
  • 19.
    Mean consonant clusterfrequency across languages
  • 20.
    Feature 5: Obstruentclusters regex1 = ur’[bcdfghjkpqstvxz ][ bcdfghjkpqstvxz ]+’ def obs_cluster(word ): oo_set = re.findall(regex1 , word , re.UNICODE) return len(oo_set) obs_cluster(u’avvirsosh ’) 2
  • 21.
    Mean obstruent clusterfrequency across languages
  • 22.
    Feature 6: Vowelclusters regex2 = ur’[ bcdfghjklmnpqrstvxz u00f1 ]+’ def v_cluster(word ): v_set = re.split(regex2 , word , re.UNICODE) vv_set = [v for v in v_set if len(v) > 1] return len(vv_set) v_cluster(u’haeshi ’) 1
  • 23.
    Mean vowel clusterfrequency across languages
  • 24.
    Data from reallanguages
  • 25.
  • 26.
  • 27.
    Assamese corpus files directory= "/home/cr/Documents/NLPwP_pres/ TDIL_assamese_corpus_data " os.listdir(directory) [’subj_art2.txt’, ’subj_politics1 .txt’, ’lit3.txt’, ’drama.txt’, ’religion2.txt’, ’criticism2.txt’, ’criticism1.txt’, ’subj_science3.txt’, ’ref_encyclopaedia -entry.txt’, ’subj_science2.txt’, ’subj_social -studies.txt’, ’music.txt’, ’subj_art1.txt ’subj_science1.txt’, ’subj_art3.txt’, ’news.txt’, ’subj_sociology .txt’, ’criticism3.txt’, ’lit8.txt’, ’subj_history1.txt’, ’lit4.txt’, ’lit6.txt’, ’religion ’subj_law.txt’, ’lit7.txt’, ’religion1.txt’, ’criticis ’lit5.txt’, ’subj_math.txt’, ’lit1.txt’, ’subj_science ’subj_science_5 .txt’, ’subj_history2.txt’, ’lit2.txt’, ’subj_science4.txt’, ’letter.txt’]
  • 28.
  • 29.
    Frequency of thesound /x/ in ’lit5.txt’ len(re.findall(ur’[ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 1313 len(re.findall(ur’u09b6 ’, assamese_sample_raw , re.UNICODE )) 298 len(re.findall(ur’u09b7 ’, assamese_sample_raw , re.UNICODE )) 195 len(re.findall(ur’u09b8 ’, assamese_sample_raw , re.UNICODE )) 820
  • 30.
    Positional restrictions Beginning aword: len(re.findall(ur’b[ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 1129 Ending a word: len(re.findall(ur’[ u09b6u09b7u09b8 ]b’, assamese_sample_raw , re.UNICODE )) 895
  • 31.
    Positional restrictions Following /a/: len(re.findall(ur’u09be[ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 57 Following /i/: len(re.findall(ur’[ u09bfu09c0 ][ u09b6u09b7u09b8]’, ssamese_sample_raw , re.UNICODE )) 70 Following /u/: len(re.findall(ur’[ u09c1u09c2 ][ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 10
  • 32.
    Further work Incorporate segmentalparameters into classifier (fix Unicode issues with NLTK’s classify module) Use classifier to predict assignment of random words from Westeros to Dothraki, Astapori Valyrian, and High Valyrian languages Isolate most important word-internal parameters in classification model (log-likelihood ranking in Naive Bayes model) Use full distributional account of select Assamese consonants as priors in acoustic classification model
  • 33.