SlideShare a Scribd company logo
Language Sleuthing HOWTO
                 or
   Discovering Interesting Things
           with Python's
     Natural Language Tool Kit


                           Brianna Laugher
                                modernthings.org
                         brianna[@.]laugher.id.au
Corpus linguistics on web
          texts




          why?
Because the web is full of
       language data

 Because linguistic techniques
can reveal unexpected insights

Because I don't want to have to
       read everything
Like... mailing lists
luv-main as a corpus



√ Big collection of text
x Messy data
x Not annotated
what's interesting?

   conversations

      topics

 change over time

     (authors)
Step 1:




get the data
wget vs Python script


√ wget is purpose-built

√ convenient options like
   --convert-links
Meaningful URLs FTW


              Sympa/MhonArc:


lists.luv.asn.au/wws/arc/luv-main/
                                 2009-04/
                                         msg00057.html
Step 2:




clean the data
Cleaning for what?

Remove archive boilerplate

      Remove HTML

   Remove quoted text?

   Remove signatures?
J.W.
J.W.




       W.E.
Behind the scenes
        J.W.




 W.E.
what are we aiming for?




what do NLTK corpora look like?
Getting NLTK


sudo apt-get install python-nltk
         in Ubuntu 10.04
                 or
sudo apt-get install python-pip
         pip install nltk
                 or
  from source at nltk.org/download
Getting NLTK data...




    an “NLTKism”
NLTK corpora types
Brown corpus
A CategorizedTagged corpus:

   Dear/jj Sirs/nns :/: Let/vb me/ppo begin/vb by/in
clearing/vbg up/in any/dti possible/jj
misconception/nn in/in your/pp$ minds/nns ,/,
wherever/wrb you/ppss are/ber ./.
The/at collective/nn by/in which/wdt I/ppss
address/vb you/ppo in/in the/at title/nn above/rb
is/bez neither/cc patronizing/vbg nor/cc jocose/jj
but/cc an/at exact/jj industrial/jj term/nn in/in
use/nn among/in professional/jj thieves/nns ./.
Inaugural corpus
A Plaintext corpus:

My fellow citizens:

I stand here today humbled by the task before us,
grateful for the trust you have bestowed, mindful
of the sacrifices borne by our ancestors. I thank
President Bush for his service to our nation, as
well as the generosity and cooperation he has
shown throughout this transition.

Forty-four Americans have now taken the
presidential oath. ...............
But we still have lots of HTML...
BeautifulSoup to the rescue



>>>   from BeautifulSoup import BeautifulSoup as BS
>>>   data = open(filename,'r').read()
>>>   soup = BS(data)
>>>   print 'n'.join(soup.findAll(text=True))
notice the blockquote!
What about blockquotes?

>>> bqs = s.findAll('blockquote')
>>> [bq.extract() for bq in bqs]
>>> print 'n'.join(s.findAll(text=True))

On 05/08/2007, at 12:05 PM, [...] wrote:
If u want it USB bootable, just burn the DSL boot disk to CD and fire it
up.  Then from the desktop after boot, right click and create the
bootable USB key yourself.  I havent actually done this myself (only
seen the option from the menu), but I am assuming it will be a fairly painless
process if you are happy with the stock image.  Would be interested in
how you go as I have to build 50 USB bootable DSL's in the next couple weeks.
Regards,
[...]
Step 3:




analyse the data
Getting it into NLTK



import nltk
path = 'path/to/files'
corpus = nltk.corpus.PlaintextCorpusReader(path,
                                     '.*.html')
What about our metadata?
Create a Python dictionary that maps filenames to
categories
e.g.
categories={}
categories['2008-12/msg00226.html'] =
                    ['year-2008',
                      'month-12',
                      'author-BM<bm@xxxxx>'
                    ]
....etc
then...
import nltk
path = 'path/to/files/'
corpus =
nltk.corpus.CategorizedPlaintextCorpusReader(path,
                    '.*.html', cat_map=categories)
Simple categories


cats = corpus.categories()
authorcats=[c for c in cats if c.startswith('author')]
#>>> len(authorcats)
#608
yearcats=[c for c in cats if c.startswith('year')]
monthcats=[c for c in cats if c.startswith('month')]
...who are the top posters?
posts = [(len(corpus.fileids(author)), author) for author in
authorcats]
posts.sort(reverse=True)

for count, author in posts[:10]:
   print "%5dt%s" % (count, author)

→

 1304    author-JW
 1294    author-RC
 1243    author-CS
 1030    author-JH
  868    author-DP
  752    author-TWB
  608    author-CS#2
  556    author-TL
  452    author-BM
  412    author-RM
(email   me if you're curious to know if you're on it...)
Frequency distributions
popular =['ubuntu','debian','fedora','arch']
niche = ['gentoo','suse','centos','redhat']

def getcfd(distros,limit):
  cfd = nltk.ConditionalFreqDist(
     (distro, fileid[:limit])
     for fileid in corpus.fileids()
     for w in corpus.words(fileid)
     for distro in distros
     if w.lower().startswith(distro))
  return cfd

popularcfd = getcfd(popular,4) # or 7 for months
popularcfd.plot()
nichecfd = getcfd(niche,4)
nichecfd.plot()
                       another “NLTKism”
'Popular' distros by month
'Popular' distros by year
'Niche' distros by year
Random text generation
import random
words = [w.lower() for w in corpus.words()]
bigrams = nltk.bigrams(words)
cfd = nltk.ConditionalFreqDist(bigrams)

def generate_model(cfdist, word, num=15):
    for i in range(num):
       print word,
       words = list(cfdist[word])
       word = random.choice(words)

text = [w.lower() for w in corpus.words()]
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
generate_model(cfd, 'hi', num=20)
hi...
hi allan : ages since apparently yum erased . attempts
now venturing into config run ip 10 431 ms 57

hi serg it illegal address entries must *, t close relative info
many families continue fi into modem and reinstalled

hi wen and amended :) imageshack does for grade service
please blame . warning issued an overall environment
consists in

hi folks i accidentally due cause excitingly stupid idiots ,
deletion flag on adding option ? branded ) mounting them

hi guys do composite required </ emulator in for
unattended has info to catalyse a dbus will see atz init3
hi from Peter...
text = [w.lower() for w in corpus.words(categories=
          [c for c in authorcats if 'PeterL' in c])]


hi everyone , hence the database schema and that run on memberdb on mail
store is 12 . yep ,

hi anita , your favourite piece of cpu cycles , he was thinking i hear the middle
of failure .

hi anita , same vhost b internal ip / nine seem odd occasion i hazard . 25ghz
g4 ibook here

hi everyone , same ) on removes a "-- nicelevel nn " as intended . 00 . main
host basis

hi cameron , no biggie . candidates in to upgrade . ubuntu dom0 install if there
! now ). txt

hi cameron , attribution for 30 seconds , and runs out on linux to on www .
luv , these
interesting collocations
                              ...or not
ext = [w.lower() for w in corpus.words() if w.isalpha()]
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(text)

finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 10)
→
bufnewfile bufread
busmaster speccycle
cellx celly
cheswick bellovin
cread clocal
curtail atl
dmcrs rscem
dmmrbc dmost
dmost dmcrs
...
oblig tag cloud


stopwords =
nltk.corpus.stopwords.words('english')
words = [w.lower() for w in corpus.words()
                                if w.isalpha()]
words = [w for w in words if w not in stopwords]
word_fd = nltk.FreqDist(words)
wordmax = word_fd[word_fd.max()]
wordmin = 1000 #YMMV
taglist = word_fd.items()
ranges = getRanges(wordmin, wordmax)
writeCloud(taglist, ranges, 'tags.html')
another one for Peter :)
cats =  [c for c in corpus.categories()
               if 'PeterL' in c]
words=[w.lower() for w in corpus.words(categories=cats)
                         if w.isalpha()]
wordmin = 10
  →
thanks!
for more corpus fun:
http://www.nltk.org/
                             The Book:
       'Natural Language Processing
                         with Python',
                 2nd ed. pub. Jan 2010



      These slides are © Brianna Laugher and are released under
           the Creative Commons Attribution ShareAlike license,
                    v3.0 unported. The data set is not free, sadly...

More Related Content

What's hot

Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)
David Beazley (Dabeaz LLC)
 
Generators: The Final Frontier
Generators: The Final FrontierGenerators: The Final Frontier
Generators: The Final Frontier
David Beazley (Dabeaz LLC)
 
Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsagniklal
 
Python for-unix-and-linux-system-administration
Python for-unix-and-linux-system-administrationPython for-unix-and-linux-system-administration
Python for-unix-and-linux-system-administration
Victor Marcelino
 
Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015
Takayuki Shimizukawa
 
Mastering Python 3 I/O
Mastering Python 3 I/OMastering Python 3 I/O
Mastering Python 3 I/O
David Beazley (Dabeaz LLC)
 
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
Puppet
 
2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge
Prof. Wim Van Criekinge
 
Python in Action (Part 2)
Python in Action (Part 2)Python in Action (Part 2)
Python in Action (Part 2)
David Beazley (Dabeaz LLC)
 
Understanding the Python GIL
Understanding the Python GILUnderstanding the Python GIL
Understanding the Python GIL
David Beazley (Dabeaz LLC)
 
Class 1: Welcome to programming
Class 1: Welcome to programmingClass 1: Welcome to programming
Class 1: Welcome to programming
Marc Gouw
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Cosimo Streppone
 
Python Tricks That You Can't Live Without
Python Tricks That You Can't Live WithoutPython Tricks That You Can't Live Without
Python Tricks That You Can't Live Without
Audrey Roy
 
Simplest-Ownage-Human-Observed… - Routers
 Simplest-Ownage-Human-Observed… - Routers Simplest-Ownage-Human-Observed… - Routers
Simplest-Ownage-Human-Observed… - RoutersLogicaltrust pl
 
D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015
Brian Coffey
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking systemJesse Vincent
 
오픈소스로 시작하는 인공지능 실습
오픈소스로 시작하는 인공지능 실습오픈소스로 시작하는 인공지능 실습
오픈소스로 시작하는 인공지능 실습
Mario Cho
 
2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekinge2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekinge
Prof. Wim Van Criekinge
 
Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
David Beazley (Dabeaz LLC)
 
한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남
Eunjeong (Lucy) Park
 

What's hot (20)

Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)
 
Generators: The Final Frontier
Generators: The Final FrontierGenerators: The Final Frontier
Generators: The Final Frontier
 
Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsag
 
Python for-unix-and-linux-system-administration
Python for-unix-and-linux-system-administrationPython for-unix-and-linux-system-administration
Python for-unix-and-linux-system-administration
 
Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015
 
Mastering Python 3 I/O
Mastering Python 3 I/OMastering Python 3 I/O
Mastering Python 3 I/O
 
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
 
2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge
 
Python in Action (Part 2)
Python in Action (Part 2)Python in Action (Part 2)
Python in Action (Part 2)
 
Understanding the Python GIL
Understanding the Python GILUnderstanding the Python GIL
Understanding the Python GIL
 
Class 1: Welcome to programming
Class 1: Welcome to programmingClass 1: Welcome to programming
Class 1: Welcome to programming
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013
 
Python Tricks That You Can't Live Without
Python Tricks That You Can't Live WithoutPython Tricks That You Can't Live Without
Python Tricks That You Can't Live Without
 
Simplest-Ownage-Human-Observed… - Routers
 Simplest-Ownage-Human-Observed… - Routers Simplest-Ownage-Human-Observed… - Routers
Simplest-Ownage-Human-Observed… - Routers
 
D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking system
 
오픈소스로 시작하는 인공지능 실습
오픈소스로 시작하는 인공지능 실습오픈소스로 시작하는 인공지능 실습
오픈소스로 시작하는 인공지능 실습
 
2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekinge2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekinge
 
Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
 
한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남
 

Viewers also liked

Beyond Open Source - Arthur Sale
Beyond Open Source - Arthur SaleBeyond Open Source - Arthur Sale
Beyond Open Source - Arthur Sale
Brianna Laugher
 
Free and open geodata: From shadows to reality - Simon Greener
Free and open geodata: From shadows to reality - Simon GreenerFree and open geodata: From shadows to reality - Simon Greener
Free and open geodata: From shadows to reality - Simon Greener
Brianna Laugher
 
Wikimedia Commons for Fun & Profit
Wikimedia Commons for Fun & ProfitWikimedia Commons for Fun & Profit
Wikimedia Commons for Fun & Profit
Brianna Laugher
 
Future directions for copyright law - Laura Simes
Future directions for copyright law - Laura SimesFuture directions for copyright law - Laura Simes
Future directions for copyright law - Laura Simes
Brianna Laugher
 
Hacking MediaWiki (For Users)
Hacking MediaWiki (For Users)Hacking MediaWiki (For Users)
Hacking MediaWiki (For Users)
Brianna Laugher
 
CFFSW - Crowdfunded free software
CFFSW - Crowdfunded free softwareCFFSW - Crowdfunded free software
CFFSW - Crowdfunded free software
Brianna Laugher
 
Special:Contributions/newbies
Special:Contributions/newbiesSpecial:Contributions/newbies
Special:Contributions/newbies
Brianna Laugher
 

Viewers also liked (7)

Beyond Open Source - Arthur Sale
Beyond Open Source - Arthur SaleBeyond Open Source - Arthur Sale
Beyond Open Source - Arthur Sale
 
Free and open geodata: From shadows to reality - Simon Greener
Free and open geodata: From shadows to reality - Simon GreenerFree and open geodata: From shadows to reality - Simon Greener
Free and open geodata: From shadows to reality - Simon Greener
 
Wikimedia Commons for Fun & Profit
Wikimedia Commons for Fun & ProfitWikimedia Commons for Fun & Profit
Wikimedia Commons for Fun & Profit
 
Future directions for copyright law - Laura Simes
Future directions for copyright law - Laura SimesFuture directions for copyright law - Laura Simes
Future directions for copyright law - Laura Simes
 
Hacking MediaWiki (For Users)
Hacking MediaWiki (For Users)Hacking MediaWiki (For Users)
Hacking MediaWiki (For Users)
 
CFFSW - Crowdfunded free software
CFFSW - Crowdfunded free softwareCFFSW - Crowdfunded free software
CFFSW - Crowdfunded free software
 
Special:Contributions/newbies
Special:Contributions/newbiesSpecial:Contributions/newbies
Special:Contributions/newbies
 

Similar to Language Sleuthing HOWTO with NLTK

Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
Jimmy Lai
 
What is Python?
What is Python?What is Python?
What is Python?
wesley chun
 
Filip palian mateuszkocielski. simplest ownage human observed… routers
Filip palian mateuszkocielski. simplest ownage human observed… routersFilip palian mateuszkocielski. simplest ownage human observed… routers
Filip palian mateuszkocielski. simplest ownage human observed… routersYury Chemerkin
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with Python
Tariq Rashid
 
node.js, javascript and the future
node.js, javascript and the futurenode.js, javascript and the future
node.js, javascript and the futureJeff Miccolis
 
2015 555 kharchenko_ppt
2015 555 kharchenko_ppt2015 555 kharchenko_ppt
2015 555 kharchenko_ppt
Maxym Kharchenko
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
apidays
 
Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Peter Higgins
 
Scale11x lxc talk
Scale11x lxc talkScale11x lxc talk
Scale11x lxc talkdotCloud
 
Integration Testing With Cucumber How To Test Anything J A O O 2009
Integration Testing With  Cucumber    How To Test Anything    J A O O 2009Integration Testing With  Cucumber    How To Test Anything    J A O O 2009
Integration Testing With Cucumber How To Test Anything J A O O 2009Dr Nic Williams
 
Kyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdf
Flavio W. Brasil
 
Os Vanrossum
Os VanrossumOs Vanrossum
Os Vanrossumoscon2007
 
Trust boundaries - Confidence 2015
Trust boundaries - Confidence 2015Trust boundaries - Confidence 2015
Trust boundaries - Confidence 2015
Logicaltrust pl
 
Rust: Reach Further
Rust: Reach FurtherRust: Reach Further
Rust: Reach Further
nikomatsakis
 
Designing A Project Using Java Programming
Designing A Project Using Java ProgrammingDesigning A Project Using Java Programming
Designing A Project Using Java Programming
Katy Allen
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
Rajesh Rajamani
 
Data analysis in R
Data analysis in RData analysis in R
Data analysis in R
Andrew Lowe
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]
Vincent Batts
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
amit kuraria
 

Similar to Language Sleuthing HOWTO with NLTK (20)

Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
What is Python?
What is Python?What is Python?
What is Python?
 
Filip palian mateuszkocielski. simplest ownage human observed… routers
Filip palian mateuszkocielski. simplest ownage human observed… routersFilip palian mateuszkocielski. simplest ownage human observed… routers
Filip palian mateuszkocielski. simplest ownage human observed… routers
 
Intro
IntroIntro
Intro
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with Python
 
node.js, javascript and the future
node.js, javascript and the futurenode.js, javascript and the future
node.js, javascript and the future
 
2015 555 kharchenko_ppt
2015 555 kharchenko_ppt2015 555 kharchenko_ppt
2015 555 kharchenko_ppt
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
 
Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.
 
Scale11x lxc talk
Scale11x lxc talkScale11x lxc talk
Scale11x lxc talk
 
Integration Testing With Cucumber How To Test Anything J A O O 2009
Integration Testing With  Cucumber    How To Test Anything    J A O O 2009Integration Testing With  Cucumber    How To Test Anything    J A O O 2009
Integration Testing With Cucumber How To Test Anything J A O O 2009
 
Kyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdf
 
Os Vanrossum
Os VanrossumOs Vanrossum
Os Vanrossum
 
Trust boundaries - Confidence 2015
Trust boundaries - Confidence 2015Trust boundaries - Confidence 2015
Trust boundaries - Confidence 2015
 
Rust: Reach Further
Rust: Reach FurtherRust: Reach Further
Rust: Reach Further
 
Designing A Project Using Java Programming
Designing A Project Using Java ProgrammingDesigning A Project Using Java Programming
Designing A Project Using Java Programming
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Data analysis in R
Data analysis in RData analysis in R
Data analysis in R
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
 

More from Brianna Laugher

So You're A Software Developer, Now What? Exploring Career Growth
So You're A Software Developer, Now What? Exploring Career GrowthSo You're A Software Developer, Now What? Exploring Career Growth
So You're A Software Developer, Now What? Exploring Career Growth
Brianna Laugher
 
Dynamic viz in the IPython Notebook
Dynamic viz in the IPython NotebookDynamic viz in the IPython Notebook
Dynamic viz in the IPython Notebook
Brianna Laugher
 
Funcargs & other fun with pytest
Funcargs & other fun with pytestFuncargs & other fun with pytest
Funcargs & other fun with pytest
Brianna Laugher
 
Zookeepr: Home-grown conference management software
Zookeepr: Home-grown conference management softwareZookeepr: Home-grown conference management software
Zookeepr: Home-grown conference management software
Brianna Laugher
 
BarCamp Geelong - Why gender should be a text field
BarCamp Geelong - Why gender should be a text fieldBarCamp Geelong - Why gender should be a text field
BarCamp Geelong - Why gender should be a text field
Brianna Laugher
 
Distributed wikis
Distributed wikisDistributed wikis
Distributed wikis
Brianna Laugher
 
Neurosexism
NeurosexismNeurosexism
Neurosexism
Brianna Laugher
 
Clash of the encyclopedias - is competition good for sharing?
Clash of the encyclopedias - is competition good for sharing?Clash of the encyclopedias - is competition good for sharing?
Clash of the encyclopedias - is competition good for sharing?
Brianna Laugher
 
Visualising geo-data
Visualising geo-dataVisualising geo-data
Visualising geo-data
Brianna Laugher
 
Wiki[mp]edia data sources & the MediaWiki API
Wiki[mp]edia data sources & the MediaWiki APIWiki[mp]edia data sources & the MediaWiki API
Wiki[mp]edia data sources & the MediaWiki API
Brianna Laugher
 
GLAM-WIKI - Wikimedia tech infrastructure
GLAM-WIKI - Wikimedia tech infrastructureGLAM-WIKI - Wikimedia tech infrastructure
GLAM-WIKI - Wikimedia tech infrastructure
Brianna Laugher
 
The right level of detail (MediaWiki, APIs)
The right level of detail (MediaWiki, APIs)The right level of detail (MediaWiki, APIs)
The right level of detail (MediaWiki, APIs)
Brianna Laugher
 
Free as in Market: Liberty and Property - Rusty Russell
Free as in Market: Liberty and Property - Rusty RussellFree as in Market: Liberty and Property - Rusty Russell
Free as in Market: Liberty and Property - Rusty Russell
Brianna Laugher
 
Public history in the digital age - Claudine Chionh
Public history in the digital age - Claudine ChionhPublic history in the digital age - Claudine Chionh
Public history in the digital age - Claudine Chionh
Brianna Laugher
 
It's all fun and games until someone wants to sue you: Reporting in the age o...
It's all fun and games until someone wants to sue you: Reporting in the age o...It's all fun and games until someone wants to sue you: Reporting in the age o...
It's all fun and games until someone wants to sue you: Reporting in the age o...
Brianna Laugher
 
Gratis & libre - Liam Wyatt
Gratis & libre - Liam WyattGratis & libre - Liam Wyatt
Gratis & libre - Liam Wyatt
Brianna Laugher
 
OpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
OpenAustralia - Everyday democracy for everybody in Australia - Matthew LandauerOpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
OpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
Brianna Laugher
 
Freedom Fighting: How do we convince the powers that be to relax their grip? ...
Freedom Fighting: How do we convince the powers that be to relax their grip? ...Freedom Fighting: How do we convince the powers that be to relax their grip? ...
Freedom Fighting: How do we convince the powers that be to relax their grip? ...
Brianna Laugher
 
Who's behind Wikipedia?
Who's behind Wikipedia?Who's behind Wikipedia?
Who's behind Wikipedia?
Brianna Laugher
 
How Free Software makes Wikipedia possible
How Free Software makes Wikipedia possibleHow Free Software makes Wikipedia possible
How Free Software makes Wikipedia possible
Brianna Laugher
 

More from Brianna Laugher (20)

So You're A Software Developer, Now What? Exploring Career Growth
So You're A Software Developer, Now What? Exploring Career GrowthSo You're A Software Developer, Now What? Exploring Career Growth
So You're A Software Developer, Now What? Exploring Career Growth
 
Dynamic viz in the IPython Notebook
Dynamic viz in the IPython NotebookDynamic viz in the IPython Notebook
Dynamic viz in the IPython Notebook
 
Funcargs & other fun with pytest
Funcargs & other fun with pytestFuncargs & other fun with pytest
Funcargs & other fun with pytest
 
Zookeepr: Home-grown conference management software
Zookeepr: Home-grown conference management softwareZookeepr: Home-grown conference management software
Zookeepr: Home-grown conference management software
 
BarCamp Geelong - Why gender should be a text field
BarCamp Geelong - Why gender should be a text fieldBarCamp Geelong - Why gender should be a text field
BarCamp Geelong - Why gender should be a text field
 
Distributed wikis
Distributed wikisDistributed wikis
Distributed wikis
 
Neurosexism
NeurosexismNeurosexism
Neurosexism
 
Clash of the encyclopedias - is competition good for sharing?
Clash of the encyclopedias - is competition good for sharing?Clash of the encyclopedias - is competition good for sharing?
Clash of the encyclopedias - is competition good for sharing?
 
Visualising geo-data
Visualising geo-dataVisualising geo-data
Visualising geo-data
 
Wiki[mp]edia data sources & the MediaWiki API
Wiki[mp]edia data sources & the MediaWiki APIWiki[mp]edia data sources & the MediaWiki API
Wiki[mp]edia data sources & the MediaWiki API
 
GLAM-WIKI - Wikimedia tech infrastructure
GLAM-WIKI - Wikimedia tech infrastructureGLAM-WIKI - Wikimedia tech infrastructure
GLAM-WIKI - Wikimedia tech infrastructure
 
The right level of detail (MediaWiki, APIs)
The right level of detail (MediaWiki, APIs)The right level of detail (MediaWiki, APIs)
The right level of detail (MediaWiki, APIs)
 
Free as in Market: Liberty and Property - Rusty Russell
Free as in Market: Liberty and Property - Rusty RussellFree as in Market: Liberty and Property - Rusty Russell
Free as in Market: Liberty and Property - Rusty Russell
 
Public history in the digital age - Claudine Chionh
Public history in the digital age - Claudine ChionhPublic history in the digital age - Claudine Chionh
Public history in the digital age - Claudine Chionh
 
It's all fun and games until someone wants to sue you: Reporting in the age o...
It's all fun and games until someone wants to sue you: Reporting in the age o...It's all fun and games until someone wants to sue you: Reporting in the age o...
It's all fun and games until someone wants to sue you: Reporting in the age o...
 
Gratis & libre - Liam Wyatt
Gratis & libre - Liam WyattGratis & libre - Liam Wyatt
Gratis & libre - Liam Wyatt
 
OpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
OpenAustralia - Everyday democracy for everybody in Australia - Matthew LandauerOpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
OpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
 
Freedom Fighting: How do we convince the powers that be to relax their grip? ...
Freedom Fighting: How do we convince the powers that be to relax their grip? ...Freedom Fighting: How do we convince the powers that be to relax their grip? ...
Freedom Fighting: How do we convince the powers that be to relax their grip? ...
 
Who's behind Wikipedia?
Who's behind Wikipedia?Who's behind Wikipedia?
Who's behind Wikipedia?
 
How Free Software makes Wikipedia possible
How Free Software makes Wikipedia possibleHow Free Software makes Wikipedia possible
How Free Software makes Wikipedia possible
 

Recently uploaded

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 

Recently uploaded (20)

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 

Language Sleuthing HOWTO with NLTK

  • 1. Language Sleuthing HOWTO or Discovering Interesting Things with Python's Natural Language Tool Kit Brianna Laugher modernthings.org brianna[@.]laugher.id.au
  • 2. Corpus linguistics on web texts why?
  • 3. Because the web is full of language data Because linguistic techniques can reveal unexpected insights Because I don't want to have to read everything
  • 5. luv-main as a corpus √ Big collection of text x Messy data x Not annotated
  • 6. what's interesting? conversations topics change over time (authors)
  • 8. wget vs Python script √ wget is purpose-built √ convenient options like --convert-links
  • 9. Meaningful URLs FTW Sympa/MhonArc: lists.luv.asn.au/wws/arc/luv-main/ 2009-04/ msg00057.html
  • 10.
  • 12. Cleaning for what? Remove archive boilerplate Remove HTML Remove quoted text? Remove signatures?
  • 13. J.W. J.W. W.E.
  • 14. Behind the scenes J.W. W.E.
  • 15. what are we aiming for? what do NLTK corpora look like?
  • 16. Getting NLTK sudo apt-get install python-nltk in Ubuntu 10.04 or sudo apt-get install python-pip pip install nltk or from source at nltk.org/download
  • 17. Getting NLTK data... an “NLTKism”
  • 18.
  • 20. Brown corpus A CategorizedTagged corpus: Dear/jj Sirs/nns :/: Let/vb me/ppo begin/vb by/in clearing/vbg up/in any/dti possible/jj misconception/nn in/in your/pp$ minds/nns ,/, wherever/wrb you/ppss are/ber ./. The/at collective/nn by/in which/wdt I/ppss address/vb you/ppo in/in the/at title/nn above/rb is/bez neither/cc patronizing/vbg nor/cc jocose/jj but/cc an/at exact/jj industrial/jj term/nn in/in use/nn among/in professional/jj thieves/nns ./.
  • 21. Inaugural corpus A Plaintext corpus: My fellow citizens: I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition. Forty-four Americans have now taken the presidential oath. ...............
  • 22. But we still have lots of HTML...
  • 23.
  • 24. BeautifulSoup to the rescue >>> from BeautifulSoup import BeautifulSoup as BS >>> data = open(filename,'r').read() >>> soup = BS(data) >>> print 'n'.join(soup.findAll(text=True))
  • 25.
  • 27. What about blockquotes? >>> bqs = s.findAll('blockquote') >>> [bq.extract() for bq in bqs] >>> print 'n'.join(s.findAll(text=True)) On 05/08/2007, at 12:05 PM, [...] wrote: If u want it USB bootable, just burn the DSL boot disk to CD and fire it up.&#xA0; Then from the desktop after boot, right click and create the bootable USB key yourself.&#xA0; I havent actually done this myself (only seen the option from the menu), but I am assuming it will be a fairly painless process if you are happy with the stock image.&#xA0; Would be interested in how you go as I have to build 50 USB bootable DSL's in the next couple weeks. Regards, [...]
  • 29. Getting it into NLTK import nltk path = 'path/to/files' corpus = nltk.corpus.PlaintextCorpusReader(path, '.*.html')
  • 30. What about our metadata? Create a Python dictionary that maps filenames to categories e.g. categories={} categories['2008-12/msg00226.html'] = ['year-2008', 'month-12', 'author-BM<bm@xxxxx>' ] ....etc then... import nltk path = 'path/to/files/' corpus = nltk.corpus.CategorizedPlaintextCorpusReader(path, '.*.html', cat_map=categories)
  • 31. Simple categories cats = corpus.categories() authorcats=[c for c in cats if c.startswith('author')] #>>> len(authorcats) #608 yearcats=[c for c in cats if c.startswith('year')] monthcats=[c for c in cats if c.startswith('month')]
  • 32. ...who are the top posters? posts = [(len(corpus.fileids(author)), author) for author in authorcats] posts.sort(reverse=True) for count, author in posts[:10]: print "%5dt%s" % (count, author) → 1304 author-JW 1294 author-RC 1243 author-CS 1030 author-JH 868 author-DP 752 author-TWB 608 author-CS#2 556 author-TL 452 author-BM 412 author-RM (email me if you're curious to know if you're on it...)
  • 33. Frequency distributions popular =['ubuntu','debian','fedora','arch'] niche = ['gentoo','suse','centos','redhat'] def getcfd(distros,limit): cfd = nltk.ConditionalFreqDist( (distro, fileid[:limit]) for fileid in corpus.fileids() for w in corpus.words(fileid) for distro in distros if w.lower().startswith(distro)) return cfd popularcfd = getcfd(popular,4) # or 7 for months popularcfd.plot() nichecfd = getcfd(niche,4) nichecfd.plot() another “NLTKism”
  • 37. Random text generation import random words = [w.lower() for w in corpus.words()] bigrams = nltk.bigrams(words) cfd = nltk.ConditionalFreqDist(bigrams) def generate_model(cfdist, word, num=15): for i in range(num): print word, words = list(cfdist[word]) word = random.choice(words) text = [w.lower() for w in corpus.words()] bigrams = nltk.bigrams(text) cfd = nltk.ConditionalFreqDist(bigrams) generate_model(cfd, 'hi', num=20)
  • 38. hi... hi allan : ages since apparently yum erased . attempts now venturing into config run ip 10 431 ms 57 hi serg it illegal address entries must *, t close relative info many families continue fi into modem and reinstalled hi wen and amended :) imageshack does for grade service please blame . warning issued an overall environment consists in hi folks i accidentally due cause excitingly stupid idiots , deletion flag on adding option ? branded ) mounting them hi guys do composite required </ emulator in for unattended has info to catalyse a dbus will see atz init3
  • 39. hi from Peter... text = [w.lower() for w in corpus.words(categories= [c for c in authorcats if 'PeterL' in c])] hi everyone , hence the database schema and that run on memberdb on mail store is 12 . yep , hi anita , your favourite piece of cpu cycles , he was thinking i hear the middle of failure . hi anita , same vhost b internal ip / nine seem odd occasion i hazard . 25ghz g4 ibook here hi everyone , same ) on removes a "-- nicelevel nn " as intended . 00 . main host basis hi cameron , no biggie . candidates in to upgrade . ubuntu dom0 install if there ! now ). txt hi cameron , attribution for 30 seconds , and runs out on linux to on www . luv , these
  • 40. interesting collocations ...or not ext = [w.lower() for w in corpus.words() if w.isalpha()] from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(text) finder.apply_freq_filter(3) finder.nbest(bigram_measures.pmi, 10) → bufnewfile bufread busmaster speccycle cellx celly cheswick bellovin cread clocal curtail atl dmcrs rscem dmmrbc dmost dmost dmcrs ...
  • 41. oblig tag cloud stopwords = nltk.corpus.stopwords.words('english') words = [w.lower() for w in corpus.words() if w.isalpha()] words = [w for w in words if w not in stopwords] word_fd = nltk.FreqDist(words) wordmax = word_fd[word_fd.max()] wordmin = 1000 #YMMV taglist = word_fd.items() ranges = getRanges(wordmin, wordmax) writeCloud(taglist, ranges, 'tags.html')
  • 42.
  • 43. another one for Peter :) cats = [c for c in corpus.categories() if 'PeterL' in c] words=[w.lower() for w in corpus.words(categories=cats) if w.isalpha()] wordmin = 10 →
  • 44. thanks! for more corpus fun: http://www.nltk.org/ The Book: 'Natural Language Processing with Python', 2nd ed. pub. Jan 2010 These slides are © Brianna Laugher and are released under the Creative Commons Attribution ShareAlike license, v3.0 unported. The data set is not free, sadly...