Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

1

Share

Download to read offline

Natural Language Processing(SupStat Inc)

Download to read offline

SupStat Inc, Natural Language Processing, NYC data science academy

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Natural Language Processing(SupStat Inc)

  1. 1. Is that Dothraki or Valyrian? and other NLP tasks with Python and NLTK Charlie Redmon | SupStat, Inc. August 18, 2014
  2. 2. Dothraki
  3. 3. Astapori Valyrian
  4. 4. High Valyrian
  5. 5. Importing raw text dothraki_f = codecs.open( "/home/cr/Python/westeros/dothraki.txt", encoding=’utf -8’) dothraki_raw = dothraki_f.read () print dothraki_raw Athchomar chomakaan , [zhey] khal vezhven. Azha anhaan asshilat ... Itte oakah! Jadi , zhey Jora Andahli. Khal vezhven. Ajjalan anha zalat vitiherat yer hatif. Kash qoy qoyi thira disse. Hash shafka zali addrivat mae , zhey Khaleesi? Ishish chare ...
  6. 6. Text processing: Cleaning punct_re = re.compile( ur’[. ,;:?! u2014u2019u2026 []] ’, re.UNICODE) dothraki_proc = punct_re.sub(’’, dothraki_raw) dothraki_proc = dothraki_proc.lower () print dothraki_proc athchomar chomakaan zhey khal vezhven azha anhaan asshilat itte oakah jadi zhey jora andahli khal vezhven ajjalan anha zalat vitiherat yer hatif kash qoy qoyi thira disse ...
  7. 7. Text processing: Tokenizing dothraki_tokens = re.split(ur’s+’, dothraki_proc) dothraki_types = set(dothraki_tokens ) print dothraki_types set([u’izzi ’, u’ale’, u’morea ’, u’vesazhao ’, u’yeri ’, u’ishish ’, u’dalen ’, u’vesazhae ’, u’yera ’, u’afisi ’, u’rhae ’, u’mawizzi ’, u’vee’, u’arrisse ’, u’ti’, u’ven’, u’rizh ’, u’afichak ’, u’gache ’, u’zigerek ’, u’zigereo ’, u’drivoe ’, u’maaz ’, u’zigeree ’, u’ayyeyoon ’, u’maan ’, u’mahrazhi ’, u’ma’, u’vos’, u’movekkhi ’, u’mahrazhis ’, u’meshafka ’, u’qisi ’, u’sani ’, u’ville ’, u’vikeesi ’, u’ifak ’, u’javrathi ’, u’zisa ’, u’chek ’, u’nem’, ... ])
  8. 8. Inspecting the lexical distribution in a text dothraki_freqdist = FreqDist( dothraki_tokens) print dothraki_freqdist <FreqDist: u’anha ’: 50, u’vos’: 40, u’me’: 39, u’ma’: 38, u’zhey ’: 29, u’mae’: 27, u’anni ’: 26, u’hash ’: 23, u’yer’: 23, u’khal ’: 16, u’khaleesi ’: 16, u’mori ’: 15, u’jin’: 13, u’kisha ’: 12, u’nem’: 11, u’vo’: 11, u’che’: 10, u’jini ’: 10, u’she’: 10, ... > dothraki_freqdist .plot (20, cumulative=True)
  9. 9. CFD of Dothraki words Top 10: anha, vos, me, ma, zhey, mae, anni, hash, yer, khal
  10. 10. Valyrian vocabulary distribution Astapori Valyrian (Top 10): ji, me, do, espo, si, mysa, eji, ez, ivetr´a, sa High Valyrian (Top 10): daor, se, issa, syt, ziry, hen, jem¯ele, lue, yne, avy
  11. 11. Feature 1: Consonant proportion def c_prop(word ): c_num = 0 for letter in u’bcdfgjklmnpqrstvxz u00f1 ’: c_num += word.count(letter) return c_num / len(word) c_prop(u’zu016bgusy ’) 0.5
  12. 12. Word-internal consonant proportions across languages
  13. 13. Feature 2: Obstruent proportion def obstruent_prop (word ): obstruent_num = 0 for letter in u’bcdfgjkpqstvxz ’ obstruent_num += word.count(letter) return obstruent_num / len(word) obstruent_prop (u’u012blvi ’) 0.25
  14. 14. Word-internal obstruent proportions across languages
  15. 15. Feature 3: Coda presence def c_coda(word ): if word [-1] in u’bcdfgjklmnpqrstvxz u00f1 ’: return 1 else: return 0 def obstruent_coda (word ): if word [-1] in u’bcdfgjkpqstvxz ’: return 1 else: return 0 c_coda(u’lysoon ’) 1 obstruent_coda (u’lysoon ’) 0
  16. 16. Mean coda consonant presence across languages
  17. 17. Mean coda obstruent presence across languages
  18. 18. Feature 4: Consonant clusters regex = ur’[ bcdfghjklmnpqrstvxz u00f1] [ bcdfghjklmnpqrstvxz u00f1 ]+’ def c_cluster(word ): cc_set = re.findall(regex , word , re.UNICODE) return len(cc_set) c_cluster(u’avvirsosh ’) 3
  19. 19. Mean consonant cluster frequency across languages
  20. 20. Feature 5: Obstruent clusters regex1 = ur’[bcdfghjkpqstvxz ][ bcdfghjkpqstvxz ]+’ def obs_cluster(word ): oo_set = re.findall(regex1 , word , re.UNICODE) return len(oo_set) obs_cluster(u’avvirsosh ’) 2
  21. 21. Mean obstruent cluster frequency across languages
  22. 22. Feature 6: Vowel clusters regex2 = ur’[ bcdfghjklmnpqrstvxz u00f1 ]+’ def v_cluster(word ): v_set = re.split(regex2 , word , re.UNICODE) vv_set = [v for v in v_set if len(v) > 1] return len(vv_set) v_cluster(u’haeshi ’) 1
  23. 23. Mean vowel cluster frequency across languages
  24. 24. Data from real languages
  25. 25. TDIL Assamese Corpus
  26. 26. TDIL Assamese Corpus
  27. 27. Assamese corpus files directory = "/home/cr/Documents/NLPwP_pres/ TDIL_assamese_corpus_data " os.listdir(directory) [’subj_art2.txt’, ’subj_politics1 .txt’, ’lit3.txt’, ’drama.txt’, ’religion2.txt’, ’criticism2.txt’, ’criticism1.txt’, ’subj_science3.txt’, ’ref_encyclopaedia -entry.txt’, ’subj_science2.txt’, ’subj_social -studies.txt’, ’music.txt’, ’subj_art1.txt ’subj_science1.txt’, ’subj_art3.txt’, ’news.txt’, ’subj_sociology .txt’, ’criticism3.txt’, ’lit8.txt’, ’subj_history1.txt’, ’lit4.txt’, ’lit6.txt’, ’religion ’subj_law.txt’, ’lit7.txt’, ’religion1.txt’, ’criticis ’lit5.txt’, ’subj_math.txt’, ’lit1.txt’, ’subj_science ’subj_science_5 .txt’, ’subj_history2.txt’, ’lit2.txt’, ’subj_science4.txt’, ’letter.txt’]
  28. 28. Assamese sample: ‘lit5.txt’
  29. 29. Frequency of the sound /x/ in ’lit5.txt’ len(re.findall(ur’[ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 1313 len(re.findall(ur’u09b6 ’, assamese_sample_raw , re.UNICODE )) 298 len(re.findall(ur’u09b7 ’, assamese_sample_raw , re.UNICODE )) 195 len(re.findall(ur’u09b8 ’, assamese_sample_raw , re.UNICODE )) 820
  30. 30. Positional restrictions Beginning a word: len(re.findall(ur’b[ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 1129 Ending a word: len(re.findall(ur’[ u09b6u09b7u09b8 ]b’, assamese_sample_raw , re.UNICODE )) 895
  31. 31. Positional restrictions Following /a/: len(re.findall(ur’u09be [ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 57 Following /i/: len(re.findall(ur’[ u09bfu09c0 ][ u09b6u09b7u09b8]’, ssamese_sample_raw , re.UNICODE )) 70 Following /u/: len(re.findall(ur’[ u09c1u09c2 ][ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 10
  32. 32. Further work Incorporate segmental parameters into classifier (fix Unicode issues with NLTK’s classify module) Use classifier to predict assignment of random words from Westeros to Dothraki, Astapori Valyrian, and High Valyrian languages Isolate most important word-internal parameters in classification model (log-likelihood ranking in Naive Bayes model) Use full distributional account of select Assamese consonants as priors in acoustic classification model
  33. 33. Thank you
  • benjaminrukundo

    Oct. 10, 2015

SupStat Inc, Natural Language Processing, NYC data science academy

Views

Total views

1,711

On Slideshare

0

From embeds

0

Number of embeds

578

Actions

Downloads

11

Shares

0

Comments

0

Likes

1

×