Your SlideShare is downloading. ×
0
Knowledge extraction from
the Encyclopedia of Life
Using Python NLTK
Anne Thessen
annethessen@gmail.com
Finding Taxonomic Names
Challenges
Eastern Lowland Gorilla
Gorilla berengei
Gorilla beringei mikenensis
Gorilla gorilla

Gorilla beringei
Matschie...
Challenges

Contextual data
Primate
Monkey
Eyes
Food
Panama
Aotus nancymaae

Contextual data

Disambiguate by
authority, s...
Beautiful Soup

GNRD

Resolver
• Common names
• Interaction type
• Common names
• Interaction type
Python NLTK
• http://nltk.org/book/
• http://nltk.org/
• Install NLTK and NLTK Data
Python NLTK
• http://nltk.org/book/
• http://nltk.org/
• Install NLTK and NLTK Data
• Natural Language Processing (NLP)
• Natural Language Processing (NLP)
• Semantic Statistics
Robin is the name of several fictional
characters appearing in c...
Robin is the name of several fictional
characters appearing in comic
books published by DC Comics,
originally created by B...
Beautiful Soup

GNRD

Resolver
Beautiful Soup

GNRD

Resolver
From GNRD
names_list = *“Pandarus sinuatus”,“Pandarus smithii”+
genera = []
for name in name_list:
row = name.split(‘ ‘)
g...
genera = *“Pandarus”,”Pandarus”+
i = -1
genus_index_list = []
for genus in genera:
genus_text = tokens[i+1:]
genus_index =...
genus_index = [36,39]
for index in genus_index_list:
species = *‘ ‘.join(tokens*index:index+2])]
#Join the genus to the wo...
tokens =
*‘Great’, ‘white’, ‘sharks’, ‘are’, ‘apex’, ‘pre
dators’, ‘,’, ‘meaning’, ‘they’, ‘have’, ‘a’, ‘lar
ge’, ‘effect’...
name_index_list = [36,38]
Looking at the first relationship:

Carcharodon carcharias

Pandarus sinuatus

term_list = []
fo...
Looking at the first relationship:

Parasite/host
Carcharodon carcharias

Pandarus sinuatus

term_list =
*‘white’, ‘sharks...
Training Data
• Show the algorithm what “parasite/host”
words look like
• Compare to an unknown
• We want “Document Classi...
Creating a Categorized Text Corpus
• http://www.packtpub.com/article/pythontext-processing-nltk-20-creating-customcorpora
...
Creating a Categorized Text Corpus
• eco
–
–
–
–
–
–
–
–

lion1
lion2
lion3
shark1
shark2
shark3
…
cats.txt

• in cats.txt...
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’
read...
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’
read...
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’
read...
Next Steps
• Build corpus
• Build Feature Extractor
• Train Classifier
Build Feature Extractor
Train Classifier
Error Checking
Knowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTK
Upcoming SlideShare
Loading in...5
×

Knowledge extraction from the Encyclopedia of Life using Python NLTK

770

Published on

This presentation demonstrates the potential for NLTK to extract information about ecological species interactions from text in EOL. It was presented Nov 12, 2013 at the Startup Institute in Cambridge, MA for the Boston PyLadies monthly meeting.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
770
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
23
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Knowledge extraction from the Encyclopedia of Life using Python NLTK"

  1. 1. Knowledge extraction from the Encyclopedia of Life Using Python NLTK Anne Thessen annethessen@gmail.com
  2. 2. Finding Taxonomic Names
  3. 3. Challenges Eastern Lowland Gorilla Gorilla berengei Gorilla beringei mikenensis Gorilla gorilla Gorilla beringei Matschie King kong ゴリラ Gorille 大猩猩 Virunga Горилла Gorilla graueri Koko Mountain gorilla Guerilla Gorila
  4. 4. Challenges Contextual data Primate Monkey Eyes Food Panama Aotus nancymaae Contextual data Disambiguate by authority, species, contextual data Legume Plant Flower Mirbeliea Australia Aotus mollis
  5. 5. Beautiful Soup GNRD Resolver
  6. 6. • Common names • Interaction type
  7. 7. • Common names • Interaction type
  8. 8. Python NLTK • http://nltk.org/book/ • http://nltk.org/ • Install NLTK and NLTK Data
  9. 9. Python NLTK • http://nltk.org/book/ • http://nltk.org/ • Install NLTK and NLTK Data
  10. 10. • Natural Language Processing (NLP)
  11. 11. • Natural Language Processing (NLP) • Semantic Statistics Robin is the name of several fictional characters appearing in comic books published by DC Comics, originally created by Bob Kane, Bill Finger and Jerry Robinson, as a junior counterpart to DC Comics superhero Batman. The team of Batman and Robin is commonly referred to as the Dynamic Duo or the Caped Crusaders. The American Robin is active mostly during the day and assembles in large flocks at night. It is one of the earliest bird species to lay eggs, beginning to breed shortly after returning to its summer range from its winter range. Its nest consists of long coarse grass, twigs, paper, and feathers, and is smeared with mud and often cushioned with grass or other soft materials. It is among the first birds to sing at dawn.
  12. 12. Robin is the name of several fictional characters appearing in comic books published by DC Comics, originally created by Bob Kane, Bill Finger and Jerry Robinson, as a junior counterpart to DC Comics superhero Batman. The team of Batman and Robin is commonly referred to as the Dynamic Duo or the Caped Crusaders. • • • • • • • fictional comic books Bob Kane superhero Batman Dynamic Duo Caped Crusaders The American Robin is active mostly during the day and assembles in large flocks at night. It is one of the earliest bird species to lay eggs, beginning to breed shortly after returning to its summer range from its winter range. Its nest consists of long coarse grass, twigs, paper, and feathers, and is smeared with mud and often cushioned with grass or other soft materials. It is among the first birds to sing at dawn. • • • • • • flocks bird eggs nest sing species
  13. 13. Beautiful Soup GNRD Resolver
  14. 14. Beautiful Soup GNRD Resolver
  15. 15. From GNRD names_list = *“Pandarus sinuatus”,“Pandarus smithii”+ genera = [] for name in name_list: row = name.split(‘ ‘) genera.append(row[0]) genera = *“Pandarus”,”Pandarus”+
  16. 16. genera = *“Pandarus”,”Pandarus”+ i = -1 genus_index_list = [] for genus in genera: genus_text = tokens[i+1:] genus_index = genus_text.index(genus) if i == -1: genus_index_list.append(genus_index) else: genus_index = genus_index + i + 1 genus_index_list.append(genus_index) i = genus_index genus_index = [36,39]
  17. 17. genus_index = [36,39] for index in genus_index_list: species = *‘ ‘.join(tokens*index:index+2])] #Join the genus to the word immediately following. if species == name_list[counter]: #Does this match the name_list? tokens[index:index+2+ = *‘ ‘.join(tokens*index:index+2])] #If yes, combine the two into one element
  18. 18. tokens = *‘Great’, ‘white’, ‘sharks’, ‘are’, ‘apex’, ‘pre dators’, ‘,’, ‘meaning’, ‘they’, ‘have’, ‘a’, ‘lar ge’, ‘effect’, ‘on’, ‘the’, ‘populations’, ‘of’, ‘t heir’, ‘prey’, ‘including’, ‘elephant’, ‘seals’, ‘and’, ‘sea’, ‘lions.’, ‘Great’, ‘white’, ‘sharks’ , ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘c opepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’+
  19. 19. name_index_list = [36,38] Looking at the first relationship: Carcharodon carcharias Pandarus sinuatus term_list = [] for name_index in name_index_list: term_list = tokens[name_index-10:name_index+10] term_list = *‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus
  20. 20. Looking at the first relationship: Parasite/host Carcharodon carcharias Pandarus sinuatus term_list = *‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’+
  21. 21. Training Data • Show the algorithm what “parasite/host” words look like • Compare to an unknown • We want “Document Classification” • Brown, Reuters and Movie Review • We need to make our own corpus
  22. 22. Creating a Categorized Text Corpus • http://www.packtpub.com/article/pythontext-processing-nltk-20-creating-customcorpora • Inside “corpus” folder create new folder for your corpus. Mine is “eco”. • Build your corpus (start with EOL text) • Make a category specification • Lets start with parasitism and predation
  23. 23. Creating a Categorized Text Corpus • eco – – – – – – – – lion1 lion2 lion3 shark1 shark2 shark3 … cats.txt • in cats.txt lion1.txt predation lion2.txt parasitism …
  24. 24. from nltk.corpus.reader import CategorizedPlaintextCorpusReader corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’ reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|sharkd*.txt’,cat_file=‘cats.txt’)
  25. 25. from nltk.corpus.reader import CategorizedPlaintextCorpusReader corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’ reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|sharkd*.txt’,cat_file=‘cats.txt’) Choose a Corpus Reader
  26. 26. from nltk.corpus.reader import CategorizedPlaintextCorpusReader corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’ reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|sharkd*.txt’,cat_file=‘cats.txt’) Choose a Corpus Reader You have to tell this Corpus Reader Corpus root directory File names (aka fileids) Category specification
  27. 27. Next Steps • Build corpus • Build Feature Extractor • Train Classifier
  28. 28. Build Feature Extractor
  29. 29. Train Classifier
  30. 30. Error Checking
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×