Language Sleuthing HOWTO
                 or
   Discovering Interesting Things
           with Python's
     Natural Language Tool Kit


                           Brianna Laugher
                                modernthings.org
                         brianna[@.]laugher.id.au
Corpus linguistics on web
          texts




          why?
Because the web is full of
       language data

 Because linguistic techniques
can reveal unexpected insights

Because I don't want to have to
       read everything
Like... mailing lists
luv-main as a corpus



√ Big collection of text
x Messy data
x Not annotated
what's interesting?

   conversations

      topics

 change over time

     (authors)
Step 1:




get the data
wget vs Python script


√ wget is purpose-built

√ convenient options like
   --convert-links
Meaningful URLs FTW


              Sympa/MhonArc:


lists.luv.asn.au/wws/arc/luv-main/
                                 2009-04/
                                         msg00057.html
Step 2:




clean the data
Cleaning for what?

Remove archive boilerplate

      Remove HTML

   Remove quoted text?

   Remove signatures?
J.W.
J.W.




       W.E.
Behind the scenes
        J.W.




 W.E.
what are we aiming for?




what do NLTK corpora look like?
Getting NLTK


sudo apt-get install python-nltk
         in Ubuntu 10.04
                 or
sudo apt-get install python-pip
         pip install nltk
                 or
  from source at nltk.org/download
Getting NLTK data...




    an “NLTKism”
NLTK corpora types
Brown corpus
A CategorizedTagged corpus:

   Dear/jj Sirs/nns :/: Let/vb me/ppo begin/vb by/in
clearing/vbg up/in any/dti possible/jj
misconception/nn in/in your/pp$ minds/nns ,/,
wherever/wrb you/ppss are/ber ./.
The/at collective/nn by/in which/wdt I/ppss
address/vb you/ppo in/in the/at title/nn above/rb
is/bez neither/cc patronizing/vbg nor/cc jocose/jj
but/cc an/at exact/jj industrial/jj term/nn in/in
use/nn among/in professional/jj thieves/nns ./.
Inaugural corpus
A Plaintext corpus:

My fellow citizens:

I stand here today humbled by the task before us,
grateful for the trust you have bestowed, mindful
of the sacrifices borne by our ancestors. I thank
President Bush for his service to our nation, as
well as the generosity and cooperation he has
shown throughout this transition.

Forty-four Americans have now taken the
presidential oath. ...............
But we still have lots of HTML...
BeautifulSoup to the rescue



>>>   from BeautifulSoup import BeautifulSoup as BS
>>>   data = open(filename,'r').read()
>>>   soup = BS(data)
>>>   print 'n'.join(soup.findAll(text=True))
notice the blockquote!
What about blockquotes?

>>> bqs = s.findAll('blockquote')
>>> [bq.extract() for bq in bqs]
>>> print 'n'.join(s.findAll(text=True))

On 05/08/2007, at 12:05 PM, [...] wrote:
If u want it USB bootable, just burn the DSL boot disk to CD and fire it
up.  Then from the desktop after boot, right click and create the
bootable USB key yourself.  I havent actually done this myself (only
seen the option from the menu), but I am assuming it will be a fairly painless
process if you are happy with the stock image.  Would be interested in
how you go as I have to build 50 USB bootable DSL's in the next couple weeks.
Regards,
[...]
Step 3:




analyse the data
Getting it into NLTK



import nltk
path = 'path/to/files'
corpus = nltk.corpus.PlaintextCorpusReader(path,
                                     '.*.html')
What about our metadata?
Create a Python dictionary that maps filenames to
categories
e.g.
categories={}
categories['2008-12/msg00226.html'] =
                    ['year-2008',
                      'month-12',
                      'author-BM<bm@xxxxx>'
                    ]
....etc
then...
import nltk
path = 'path/to/files/'
corpus =
nltk.corpus.CategorizedPlaintextCorpusReader(path,
                    '.*.html', cat_map=categories)
Simple categories


cats = corpus.categories()
authorcats=[c for c in cats if c.startswith('author')]
#>>> len(authorcats)
#608
yearcats=[c for c in cats if c.startswith('year')]
monthcats=[c for c in cats if c.startswith('month')]
...who are the top posters?
posts = [(len(corpus.fileids(author)), author) for author in
authorcats]
posts.sort(reverse=True)

for count, author in posts[:10]:
   print "%5dt%s" % (count, author)

→

 1304    author-JW
 1294    author-RC
 1243    author-CS
 1030    author-JH
  868    author-DP
  752    author-TWB
  608    author-CS#2
  556    author-TL
  452    author-BM
  412    author-RM
(email   me if you're curious to know if you're on it...)
Frequency distributions
popular =['ubuntu','debian','fedora','arch']
niche = ['gentoo','suse','centos','redhat']

def getcfd(distros,limit):
  cfd = nltk.ConditionalFreqDist(
     (distro, fileid[:limit])
     for fileid in corpus.fileids()
     for w in corpus.words(fileid)
     for distro in distros
     if w.lower().startswith(distro))
  return cfd

popularcfd = getcfd(popular,4) # or 7 for months
popularcfd.plot()
nichecfd = getcfd(niche,4)
nichecfd.plot()
                       another “NLTKism”
'Popular' distros by month
'Popular' distros by year
'Niche' distros by year
Random text generation
import random
words = [w.lower() for w in corpus.words()]
bigrams = nltk.bigrams(words)
cfd = nltk.ConditionalFreqDist(bigrams)

def generate_model(cfdist, word, num=15):
    for i in range(num):
       print word,
       words = list(cfdist[word])
       word = random.choice(words)

text = [w.lower() for w in corpus.words()]
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
generate_model(cfd, 'hi', num=20)
hi...
hi allan : ages since apparently yum erased . attempts
now venturing into config run ip 10 431 ms 57

hi serg it illegal address entries must *, t close relative info
many families continue fi into modem and reinstalled

hi wen and amended :) imageshack does for grade service
please blame . warning issued an overall environment
consists in

hi folks i accidentally due cause excitingly stupid idiots ,
deletion flag on adding option ? branded ) mounting them

hi guys do composite required </ emulator in for
unattended has info to catalyse a dbus will see atz init3
hi from Peter...
text = [w.lower() for w in corpus.words(categories=
          [c for c in authorcats if 'PeterL' in c])]


hi everyone , hence the database schema and that run on memberdb on mail
store is 12 . yep ,

hi anita , your favourite piece of cpu cycles , he was thinking i hear the middle
of failure .

hi anita , same vhost b internal ip / nine seem odd occasion i hazard . 25ghz
g4 ibook here

hi everyone , same ) on removes a "-- nicelevel nn " as intended . 00 . main
host basis

hi cameron , no biggie . candidates in to upgrade . ubuntu dom0 install if there
! now ). txt

hi cameron , attribution for 30 seconds , and runs out on linux to on www .
luv , these
interesting collocations
                              ...or not
ext = [w.lower() for w in corpus.words() if w.isalpha()]
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(text)

finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 10)
→
bufnewfile bufread
busmaster speccycle
cellx celly
cheswick bellovin
cread clocal
curtail atl
dmcrs rscem
dmmrbc dmost
dmost dmcrs
...
oblig tag cloud


stopwords =
nltk.corpus.stopwords.words('english')
words = [w.lower() for w in corpus.words()
                                if w.isalpha()]
words = [w for w in words if w not in stopwords]
word_fd = nltk.FreqDist(words)
wordmax = word_fd[word_fd.max()]
wordmin = 1000 #YMMV
taglist = word_fd.items()
ranges = getRanges(wordmin, wordmax)
writeCloud(taglist, ranges, 'tags.html')
another one for Peter :)
cats =  [c for c in corpus.categories()
               if 'PeterL' in c]
words=[w.lower() for w in corpus.words(categories=cats)
                         if w.isalpha()]
wordmin = 10
  →
thanks!
for more corpus fun:
http://www.nltk.org/
                             The Book:
       'Natural Language Processing
                         with Python',
                 2nd ed. pub. Jan 2010



      These slides are © Brianna Laugher and are released under
           the Creative Commons Attribution ShareAlike license,
                    v3.0 unported. The data set is not free, sadly...

Language Sleuthing HOWTO with NLTK

  • 1.
    Language Sleuthing HOWTO or Discovering Interesting Things with Python's Natural Language Tool Kit Brianna Laugher modernthings.org brianna[@.]laugher.id.au
  • 2.
    Corpus linguistics onweb texts why?
  • 3.
    Because the webis full of language data Because linguistic techniques can reveal unexpected insights Because I don't want to have to read everything
  • 4.
  • 5.
    luv-main as acorpus √ Big collection of text x Messy data x Not annotated
  • 6.
    what's interesting? conversations topics change over time (authors)
  • 7.
  • 8.
    wget vs Pythonscript √ wget is purpose-built √ convenient options like --convert-links
  • 9.
    Meaningful URLs FTW Sympa/MhonArc: lists.luv.asn.au/wws/arc/luv-main/ 2009-04/ msg00057.html
  • 11.
  • 12.
    Cleaning for what? Removearchive boilerplate Remove HTML Remove quoted text? Remove signatures?
  • 13.
  • 14.
  • 15.
    what are weaiming for? what do NLTK corpora look like?
  • 16.
    Getting NLTK sudo apt-getinstall python-nltk in Ubuntu 10.04 or sudo apt-get install python-pip pip install nltk or from source at nltk.org/download
  • 17.
    Getting NLTK data... an “NLTKism”
  • 19.
  • 20.
    Brown corpus A CategorizedTaggedcorpus: Dear/jj Sirs/nns :/: Let/vb me/ppo begin/vb by/in clearing/vbg up/in any/dti possible/jj misconception/nn in/in your/pp$ minds/nns ,/, wherever/wrb you/ppss are/ber ./. The/at collective/nn by/in which/wdt I/ppss address/vb you/ppo in/in the/at title/nn above/rb is/bez neither/cc patronizing/vbg nor/cc jocose/jj but/cc an/at exact/jj industrial/jj term/nn in/in use/nn among/in professional/jj thieves/nns ./.
  • 21.
    Inaugural corpus A Plaintextcorpus: My fellow citizens: I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition. Forty-four Americans have now taken the presidential oath. ...............
  • 22.
    But we stillhave lots of HTML...
  • 24.
    BeautifulSoup to therescue >>> from BeautifulSoup import BeautifulSoup as BS >>> data = open(filename,'r').read() >>> soup = BS(data) >>> print 'n'.join(soup.findAll(text=True))
  • 26.
  • 27.
    What about blockquotes? >>>bqs = s.findAll('blockquote') >>> [bq.extract() for bq in bqs] >>> print 'n'.join(s.findAll(text=True)) On 05/08/2007, at 12:05 PM, [...] wrote: If u want it USB bootable, just burn the DSL boot disk to CD and fire it up.&#xA0; Then from the desktop after boot, right click and create the bootable USB key yourself.&#xA0; I havent actually done this myself (only seen the option from the menu), but I am assuming it will be a fairly painless process if you are happy with the stock image.&#xA0; Would be interested in how you go as I have to build 50 USB bootable DSL's in the next couple weeks. Regards, [...]
  • 28.
  • 29.
    Getting it intoNLTK import nltk path = 'path/to/files' corpus = nltk.corpus.PlaintextCorpusReader(path, '.*.html')
  • 30.
    What about ourmetadata? Create a Python dictionary that maps filenames to categories e.g. categories={} categories['2008-12/msg00226.html'] = ['year-2008', 'month-12', 'author-BM<bm@xxxxx>' ] ....etc then... import nltk path = 'path/to/files/' corpus = nltk.corpus.CategorizedPlaintextCorpusReader(path, '.*.html', cat_map=categories)
  • 31.
    Simple categories cats =corpus.categories() authorcats=[c for c in cats if c.startswith('author')] #>>> len(authorcats) #608 yearcats=[c for c in cats if c.startswith('year')] monthcats=[c for c in cats if c.startswith('month')]
  • 32.
    ...who are thetop posters? posts = [(len(corpus.fileids(author)), author) for author in authorcats] posts.sort(reverse=True) for count, author in posts[:10]: print "%5dt%s" % (count, author) → 1304 author-JW 1294 author-RC 1243 author-CS 1030 author-JH 868 author-DP 752 author-TWB 608 author-CS#2 556 author-TL 452 author-BM 412 author-RM (email me if you're curious to know if you're on it...)
  • 33.
    Frequency distributions popular =['ubuntu','debian','fedora','arch'] niche= ['gentoo','suse','centos','redhat'] def getcfd(distros,limit): cfd = nltk.ConditionalFreqDist( (distro, fileid[:limit]) for fileid in corpus.fileids() for w in corpus.words(fileid) for distro in distros if w.lower().startswith(distro)) return cfd popularcfd = getcfd(popular,4) # or 7 for months popularcfd.plot() nichecfd = getcfd(niche,4) nichecfd.plot() another “NLTKism”
  • 34.
  • 35.
  • 36.
  • 37.
    Random text generation importrandom words = [w.lower() for w in corpus.words()] bigrams = nltk.bigrams(words) cfd = nltk.ConditionalFreqDist(bigrams) def generate_model(cfdist, word, num=15): for i in range(num): print word, words = list(cfdist[word]) word = random.choice(words) text = [w.lower() for w in corpus.words()] bigrams = nltk.bigrams(text) cfd = nltk.ConditionalFreqDist(bigrams) generate_model(cfd, 'hi', num=20)
  • 38.
    hi... hi allan :ages since apparently yum erased . attempts now venturing into config run ip 10 431 ms 57 hi serg it illegal address entries must *, t close relative info many families continue fi into modem and reinstalled hi wen and amended :) imageshack does for grade service please blame . warning issued an overall environment consists in hi folks i accidentally due cause excitingly stupid idiots , deletion flag on adding option ? branded ) mounting them hi guys do composite required </ emulator in for unattended has info to catalyse a dbus will see atz init3
  • 39.
    hi from Peter... text= [w.lower() for w in corpus.words(categories= [c for c in authorcats if 'PeterL' in c])] hi everyone , hence the database schema and that run on memberdb on mail store is 12 . yep , hi anita , your favourite piece of cpu cycles , he was thinking i hear the middle of failure . hi anita , same vhost b internal ip / nine seem odd occasion i hazard . 25ghz g4 ibook here hi everyone , same ) on removes a "-- nicelevel nn " as intended . 00 . main host basis hi cameron , no biggie . candidates in to upgrade . ubuntu dom0 install if there ! now ). txt hi cameron , attribution for 30 seconds , and runs out on linux to on www . luv , these
  • 40.
    interesting collocations ...or not ext = [w.lower() for w in corpus.words() if w.isalpha()] from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(text) finder.apply_freq_filter(3) finder.nbest(bigram_measures.pmi, 10) → bufnewfile bufread busmaster speccycle cellx celly cheswick bellovin cread clocal curtail atl dmcrs rscem dmmrbc dmost dmost dmcrs ...
  • 41.
    oblig tag cloud stopwords= nltk.corpus.stopwords.words('english') words = [w.lower() for w in corpus.words() if w.isalpha()] words = [w for w in words if w not in stopwords] word_fd = nltk.FreqDist(words) wordmax = word_fd[word_fd.max()] wordmin = 1000 #YMMV taglist = word_fd.items() ranges = getRanges(wordmin, wordmax) writeCloud(taglist, ranges, 'tags.html')
  • 43.
    another one forPeter :) cats = [c for c in corpus.categories() if 'PeterL' in c] words=[w.lower() for w in corpus.words(categories=cats) if w.isalpha()] wordmin = 10 →
  • 44.
    thanks! for more corpusfun: http://www.nltk.org/ The Book: 'Natural Language Processing with Python', 2nd ed. pub. Jan 2010 These slides are © Brianna Laugher and are released under the Creative Commons Attribution ShareAlike license, v3.0 unported. The data set is not free, sadly...