Text Analysis with
Python
Vijay Ramachandran
Fools Rush In?
”The fact is, that to do anything in the world worth doing, we must not stand
back shivering and thinking of the cold and danger, but jump in and
scramble through as well as we can.” - Robert Cushing
Motivation
 Machine Learning
everywhere
 Users expectations of
”standard experience”
 Many Resources!
Text Mining
 Extract high quality information from
text
 Typically, trends and patterns are
analysed using statistical methods –
Machine Learning
 Common Tasks – entity recognition,
sentiment analysis, categorization,
clustering
Why Python?
 Short, concise text processing
 NLTK
 Scipy, numpy, scikit.learn
 Integration with other languages!
Because when you start your company, YOU get
to decide!
Pre-processing
 Lower casing, stripping extra characters
 ”Realyyyyyyyyyy!!!!!”
 Tokenisation (sentences, words)
>>>pktst = nltk.data.load('tokenizers/punkt/english.pickle')
>>>sentences = pktst.tokenize(tweet)
>>>words = nltk.word_tokenize(sent)
 Handling Entities
>>> re.sub(r'(^| )@[^ ]+','',tweet).strip()
 Removing stopwords
>>> stopwords = set([”a”, ”an”, ”the”, ”by”])
>>> ' '.join([w for w in words if w not in stopwords])
Leveraging a Corpus
 Simple techniques to analyze a domain
 Term Frequency to find important entities
 ”low light photography”, ”travel photography”
 tf/idf to find representative terms across
domains
 ”gaming” for TVs
 GND to find aliases
 e.g., ”e700” and ”Samsung e700” vs ”e310” and
”Samsung e310”
 Yahoo BOSS is great!
Supervised Classification
Bayes Theorem
 Conditional Probability
 Bayesian classifiers - given features, find
Probability of Class
PC∣Fi ,...,Fn=PC∏i
PFi∣C
Features
 Characteristics of to-be-classified object
 For text, typically unigrams, ”n”-grams, POS,
CHUNK, presence in gazeteer
 Can be numerical or boolean
 Crucial to performance of classifier!
 In NLTK, a dict of feature name to value
>>> {”word1” : 2, ”word2” : 1, ”word1-word2” : 1,
”word2-word3” : 3, ”word1-in-monuments” : False}
Training
 ”Teach” the classifier how to classify, using
training data
>>> from nltk import NaiveBayesClassifier as nbc
>>> trg = generate_features(training_samples)
>>> random.shuffle(trg)
>>> train, test = trg[:int(0.9*len(trg)], trg[int(0.9*len(trg):]
>>> clf = nbc.train(train)
>>> nltk.classify.accuracy(clf, test)
Training, part 2
 Measure, tune, iterate
 Cross Validation
 Aim is to find a balance between Precision and
Recall
A Gender Classifier
def gender_features(word):
return {'last_letter': word[-1]}
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
featuresets = [(gender_features(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)
Advice on Training
 Its TEDIOUS! Cut the cognitive load
e.g.: ”I love this camera!”
BAD:
0 I PRP B-NP -
1 love VBP B-VP -
2 this DT B-NP -
3 camera NN I-NP 1
Good: (I, love) – NO, (love, this camera) – YES
 Dealing with ambiguity
 Mechanical Turk, jsonwidget
Other Classifiers
 Maximum Entropy
 Support Vector Machines
 Conditional Random Fields
All follow the same workflow!
More Examples
 Find questions in Tweets
 unigrams, bigrams, trigrams, parse distance
 Recognize ”contextual questions” in
discussions
 ”reco” words, ”thanks” words
 ”Use-type” recognizer
 POS, CHUNK, special words, verbs within 3 words
of phrase
 e.g., ”I want to compose the perfect landscape shot”
Eats, shoots and leaves
 Common basic step!
 POS, chunk, parse
tree
 Hairy theory
 FSA, Morphology,
Phonology
 n-Grams, Probabilistic
models, CFGs
Don't Worry, Be Happy!
No License Required!
 What to use? NLTK, or ?
 Docking with the Evil MotherShip
 Jepp, JPype?
 ► use RPC
>>> from .stanford_corenlp import jsonrpc
>>> server = jsonrpc.TransportTcpIp(...)
>>> result = loads(server.parse(paragraph))
 POS, Parse tree, and more (but no chunks?)
Wordnet®
 ”a large lexical database of English”
 synsets, hypernyms, hyponyms, gloss
 synonyms, antonyms
 Sentiwordnet
 Start with candidate ”good” and ”bad” words
 Expand by recursively following edges
 Classify using definition
e.g., ”good” → ”better”, ”good” → ”impressive”
Love or Hate?
”I love the screen, but the battery life is poor”
 Shallow?
 3 class classifier
 Or Deep?
 Relationship classifier
 Extracting candidate subjects
 Lots of unsolved problems – co-reference, multiple
subjects, negation, etc.
 Summary ratings for ”Executive Summaries”
Gender, revisited
 solr - search semi-structured text
 Out of the box text processing utilities – stemming,
tokenising
 Highly configurable relevancy
 fields, weights
 Sorting!
 ”sort(term, field, edit) desc” for Levenshtein edit
distance
Gender, revisited
 Schema
<field name="name" type="string" />
<field name="name_phoneme" type="phonetic" />
 Search: add ”sort=strdist(unknown_name, name, edit) desc)
 Python:
for namerec in results:
If namerec.gender == 'Male':
male_score += namerec.match_score
else:
female_score += namerec.match_score
 Correctly guesses ”Sheena” and ”Ashish”!!
 80/80 precision/recall
Miles to go before I sleep!
 Machine Learning on coursera

Thank You!!
Vijay Ramachandran: vijay750@gmail.com

Text analysis using python

  • 1.
  • 2.
    Fools Rush In? ”Thefact is, that to do anything in the world worth doing, we must not stand back shivering and thinking of the cold and danger, but jump in and scramble through as well as we can.” - Robert Cushing
  • 3.
    Motivation  Machine Learning everywhere Users expectations of ”standard experience”  Many Resources!
  • 4.
    Text Mining  Extracthigh quality information from text  Typically, trends and patterns are analysed using statistical methods – Machine Learning  Common Tasks – entity recognition, sentiment analysis, categorization, clustering
  • 5.
    Why Python?  Short,concise text processing  NLTK  Scipy, numpy, scikit.learn  Integration with other languages! Because when you start your company, YOU get to decide!
  • 6.
    Pre-processing  Lower casing,stripping extra characters  ”Realyyyyyyyyyy!!!!!”  Tokenisation (sentences, words) >>>pktst = nltk.data.load('tokenizers/punkt/english.pickle') >>>sentences = pktst.tokenize(tweet) >>>words = nltk.word_tokenize(sent)  Handling Entities >>> re.sub(r'(^| )@[^ ]+','',tweet).strip()  Removing stopwords >>> stopwords = set([”a”, ”an”, ”the”, ”by”]) >>> ' '.join([w for w in words if w not in stopwords])
  • 7.
    Leveraging a Corpus Simple techniques to analyze a domain  Term Frequency to find important entities  ”low light photography”, ”travel photography”  tf/idf to find representative terms across domains  ”gaming” for TVs  GND to find aliases  e.g., ”e700” and ”Samsung e700” vs ”e310” and ”Samsung e310”  Yahoo BOSS is great!
  • 8.
  • 9.
    Bayes Theorem  ConditionalProbability  Bayesian classifiers - given features, find Probability of Class PC∣Fi ,...,Fn=PC∏i PFi∣C
  • 10.
    Features  Characteristics ofto-be-classified object  For text, typically unigrams, ”n”-grams, POS, CHUNK, presence in gazeteer  Can be numerical or boolean  Crucial to performance of classifier!  In NLTK, a dict of feature name to value >>> {”word1” : 2, ”word2” : 1, ”word1-word2” : 1, ”word2-word3” : 3, ”word1-in-monuments” : False}
  • 11.
    Training  ”Teach” theclassifier how to classify, using training data >>> from nltk import NaiveBayesClassifier as nbc >>> trg = generate_features(training_samples) >>> random.shuffle(trg) >>> train, test = trg[:int(0.9*len(trg)], trg[int(0.9*len(trg):] >>> clf = nbc.train(train) >>> nltk.classify.accuracy(clf, test)
  • 12.
    Training, part 2 Measure, tune, iterate  Cross Validation  Aim is to find a balance between Precision and Recall
  • 13.
    A Gender Classifier defgender_features(word): return {'last_letter': word[-1]} names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')]) featuresets = [(gender_features(n), g) for (n,g) in names] train_set, test_set = featuresets[500:], featuresets[:500] classifier = nltk.NaiveBayesClassifier.train(train_set) nltk.classify.accuracy(classifier, test_set)
  • 14.
    Advice on Training Its TEDIOUS! Cut the cognitive load e.g.: ”I love this camera!” BAD: 0 I PRP B-NP - 1 love VBP B-VP - 2 this DT B-NP - 3 camera NN I-NP 1 Good: (I, love) – NO, (love, this camera) – YES  Dealing with ambiguity  Mechanical Turk, jsonwidget
  • 15.
    Other Classifiers  MaximumEntropy  Support Vector Machines  Conditional Random Fields All follow the same workflow!
  • 16.
    More Examples  Findquestions in Tweets  unigrams, bigrams, trigrams, parse distance  Recognize ”contextual questions” in discussions  ”reco” words, ”thanks” words  ”Use-type” recognizer  POS, CHUNK, special words, verbs within 3 words of phrase  e.g., ”I want to compose the perfect landscape shot”
  • 17.
    Eats, shoots andleaves  Common basic step!  POS, chunk, parse tree  Hairy theory  FSA, Morphology, Phonology  n-Grams, Probabilistic models, CFGs Don't Worry, Be Happy!
  • 18.
    No License Required! What to use? NLTK, or ?  Docking with the Evil MotherShip  Jepp, JPype?  ► use RPC >>> from .stanford_corenlp import jsonrpc >>> server = jsonrpc.TransportTcpIp(...) >>> result = loads(server.parse(paragraph))  POS, Parse tree, and more (but no chunks?)
  • 19.
    Wordnet®  ”a largelexical database of English”  synsets, hypernyms, hyponyms, gloss  synonyms, antonyms  Sentiwordnet  Start with candidate ”good” and ”bad” words  Expand by recursively following edges  Classify using definition e.g., ”good” → ”better”, ”good” → ”impressive”
  • 20.
    Love or Hate? ”Ilove the screen, but the battery life is poor”  Shallow?  3 class classifier  Or Deep?  Relationship classifier  Extracting candidate subjects  Lots of unsolved problems – co-reference, multiple subjects, negation, etc.  Summary ratings for ”Executive Summaries”
  • 21.
    Gender, revisited  solr- search semi-structured text  Out of the box text processing utilities – stemming, tokenising  Highly configurable relevancy  fields, weights  Sorting!  ”sort(term, field, edit) desc” for Levenshtein edit distance
  • 22.
    Gender, revisited  Schema <fieldname="name" type="string" /> <field name="name_phoneme" type="phonetic" />  Search: add ”sort=strdist(unknown_name, name, edit) desc)  Python: for namerec in results: If namerec.gender == 'Male': male_score += namerec.match_score else: female_score += namerec.match_score  Correctly guesses ”Sheena” and ”Ashish”!!  80/80 precision/recall
  • 23.
    Miles to gobefore I sleep!  Machine Learning on coursera 
  • 24.