Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

KiwiPyCon 2014 talk - Understanding human language with Python

1,870 views

Published on

Introduction into Natural Language Processing:
- Fiction vs Reality
- Complexities of NLP
- NLP with Python: NLTK, Gensim, TextBlob
(stopwords removal, part of speech tagging, tfidf, text categorization, sentiment analysis
- What's next

Published in: Software
  • Be the first to comment

KiwiPyCon 2014 talk - Understanding human language with Python

  1. 1. Understanding human language with Python Alyona Medelyan
  2. 2. Who am I? Alyona Medelyan aka @zelandiya ▪ In Natural Language Processing since 2000 ▪ PhD in NLP & Machine Learning from Waikato ▪ Author of the open source keyword extraction algorithm Maui ▪ Author of the most-cited 2009 journal survey “Mining Meaning with Wikipedia” ▪ Past: Chief Research Officer at Pingar ▪ Now: Founder of Entopix, NLP consultancy & software development
  3. 3. Agenda State of NLP Recap on fiction vs reality: Are we there yet? NLP Complexities Why is understanding language so complex? NLP using Python NLTK, Gensim, TextBlob & Co Building NLP applications A little bit of data science Other NLP areas And what’s coming next
  4. 4. State of NLP Fiction versus Reality
  5. 5. He (KITT) “always had an ego that was easy to bruise and displayed a very sensitive, but kind and dryly humorous personality.” - Wikipedia
  6. 6. Android Auto: “hands-free operation through voice commands will be emphasized to ensure safe driving”
  7. 7. “by putting this into one's ear one can instantly understand anything said in any language” (Hitchhiker Wiki)
  8. 8. WordLense: “augmented reality translation”
  9. 9. Two girls use Google Translate to call a real Indian restaurant and order in Hindi… How did it go? www.youtube.com/watch?v=wxDRburxwz8
  10. 10. The LCARS (or simply library computer) … used sophisticated artificial intelligence routines to understand and execute vocal natural language commands (From Memory Alpha Wiki)
  11. 11. Let’s try out Google
  12. 12. “Samantha [the OS] proves to be constantly available, always curious and interested, supportive and undemanding”
  13. 13. Siri doesn’t seem to be as “available”
  14. 14. NLP Complexities Why is understanding language so complex?
  15. 15. Word segmentation complexities ▪ 广大发展中国家一致支持这个目标,并提出了各自的期望细节。 ▪ 广大发展中国家一致支持这个目标,并提出了各自的期望细节。 ▪ The first hot dogs were sold by Charles Feltman on Coney Island in 1870. ▪ The first hot dogs were sold by Charles Feltman on Coney Island in 1870.
  16. 16. Disambiguation complexities Flying planes can be dangerous
  17. 17. NLP using Python NLTK, Gensim, TextBlob & Co
  18. 18. text text text text text text text text text text text text text text text text text text sentiment keywords tags genre categories taxonomy terms entities names patterns biochemical … entities text text text text text text text text text text text text text text text text text text What can we do with text?
  19. 19. NLTK Python platform for NLP
  20. 20. How to get to the core words? Remove Stopwords with NLTK even the acting in transcendence is solid , with the dreamy depp turning in a typically strong performance i think that transcendence has a pretty solid acting, with the dreamy depp turning in a strong performance as he usually does >>> from nltk.corpus import stopwords >>> stop = stopwords.words('english') >>> words = ['the', 'acting', 'in', 'transcendence', 'is', 'solid', 'with', 'the', 'dreamy', 'depp'] >>> print [word for word in words if word not in stop] ['acting', 'transcendence', 'solid’, 'dreamy', 'depp']
  21. 21. Getting closer to the meaning: Part of Speech tagging with NLTK Flying planes can be dangerous ✓ >>> import nltk >>> from nltk.tokenize import word_tokenize >>> nltk.pos_tag(word_tokenize("Flying planes can be dangerous")) [('Flying', 'VBG'), ('planes', 'NNS'), ('can', 'MD'), ('be', 'VB'), ('dangerous', 'JJ')]
  22. 22. Keyword scoring: TFxIDF Relative frequency of a term t in a document d The inverse proportion of documents d in collection D mentioning term t
  23. 23. TFxIDF with Gensim from nltk.corpus import movie_reviews from gensim import corpora, models texts = [] for fileid in movie_reviews.fileids(): words = texts.append(movie_reviews.words(fileid)) dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] tfidf = models.TfidfModel(corpus)
  24. 24. TFxIDF with Gensim (Results) for word in ['film', 'movie', 'comedy', 'violence', 'jolie']: my_id = dictionary.token2id.get(word) print word, 't', tfidf.idfs[my_id] film 0.190174003903 movie 0.364013496254 comedy 1.98564470702 violence 3.2108967825 jolie 6.96578428466
  25. 25. Where does this text belong? Text Categorization with NLTK Entertainment TVNZ: “Obama and Hangover star trade insults in interview” Politics >>> train_set = [(document_features(d), c) for (d,c) in categorized_documents] >>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> doc_features = document_features(new_document) >>> category = classifier.classify(features)
  26. 26. Sentiment analysis with TextBlob >>> from textblob import TextBlob >>> blob = TextBlob("I love this library") >>> blob.sentiment Sentiment(polarity=0.5, subjectivity=0.6) for review in transcendence: blob = TextBlob(open(review).read()) print review, blob.sentiment.polarity ../data/transcendence_1star.txt 0.0170799124247 ../data/transcendence_5star.txt 0.0874591503268 ../data/transcendence_8star.txt 0.256845238095 ../data/transcendence_10star.txt 0.304310344828
  27. 27. Building NLP applications A little bit of data science
  28. 28. Keywords extracton in 3h: Understanding a movie review …four of the biggest directors in hollywood : quentin tarantino , robert rodriguez , … were all directing one big film with a big and popular cast ...the second room ( jennifer beals ) was better , but lacking in plot ... the bumbling and mumbling bellboy … ruins every joke in the film … bellboy jennifer beals four rooms beals rooms tarantino madonna antonio banderas valeria golino github.com/zelandiya/KiwiPyCon-NLP-tutorial
  29. 29. Keyword extraction on 2000 movie reviews: What makes a successful movie? Negative Positive van damme zeta – jones smith batman de palma eddie murphy killer tommy lee jones wild west mars murphy ship space brothers de bont ... star wars disney war de niro jackie alien jackie chan private ryan truman show ben stiller cameron science fiction cameron diaz fiction jack ...
  30. 30. How NLP can help a beer drinker? Sweaty Horse Blanket: Processing the Natural Language of Beer by Ben Fields vimeo.com/96809735
  31. 31. Other NLP areas What’s coming next?
  32. 32. Filling the gaps in machine understanding … Jack Ruby, who killed J.F.Kennedy's assassin Lee Harvey Oswald. … /m/0d3k14 /m/044sb /m/0d3k14 Freebase
  33. 33. What’s next? Vs.
  34. 34. Conclusions: Understanding human language with Python NLTK nltk.org Are we there yet? radimrehurek.com/gensim scikit-learn.org/stable deeplearning.net/software/theano textblob.readthedocs.org @zelandiya #nlproc

×