SlideShare a Scribd company logo
1 of 33
Knowledge extraction from
the Encyclopedia of Life
Using Python NLTK
Anne Thessen
annethessen@gmail.com
Finding Taxonomic Names
Challenges
Eastern Lowland Gorilla
Gorilla berengei
Gorilla beringei mikenensis
Gorilla gorilla

Gorilla beringei
Matschie

King kong
ゴリラ
Gorille

大猩猩
Virunga
Горилла

Gorilla
graueri
Koko

Mountain
gorilla

Guerilla

Gorila
Challenges

Contextual data
Primate
Monkey
Eyes
Food
Panama
Aotus nancymaae

Contextual data

Disambiguate by
authority, species,
contextual data

Legume
Plant
Flower
Mirbeliea
Australia
Aotus mollis
Beautiful Soup

GNRD

Resolver
• Common names
• Interaction type
• Common names
• Interaction type
Python NLTK
• http://nltk.org/book/
• http://nltk.org/
• Install NLTK and NLTK Data
Python NLTK
• http://nltk.org/book/
• http://nltk.org/
• Install NLTK and NLTK Data
• Natural Language Processing (NLP)
• Natural Language Processing (NLP)
• Semantic Statistics
Robin is the name of several fictional
characters appearing in comic
books published by DC Comics,
originally created by Bob Kane, Bill
Finger and Jerry Robinson, as a junior
counterpart to DC
Comics superhero Batman. The team of
Batman and Robin is commonly
referred to as the Dynamic Duo or
the Caped Crusaders.

The American Robin is active mostly
during the day and assembles in large
flocks at night. It is one of the earliest
bird species to lay eggs, beginning to
breed shortly after returning to its
summer range from its winter range. Its
nest consists of long coarse grass,
twigs, paper, and feathers, and is
smeared with mud and often
cushioned with grass or other soft
materials. It is among the first birds to
sing at dawn.
Robin is the name of several fictional
characters appearing in comic
books published by DC Comics,
originally created by Bob Kane, Bill
Finger and Jerry Robinson, as a junior
counterpart to DC
Comics superhero Batman. The team of
Batman and Robin is commonly
referred to as the Dynamic Duo or
the Caped Crusaders.

•
•
•
•
•
•
•

fictional
comic books
Bob Kane
superhero
Batman
Dynamic Duo
Caped Crusaders

The American Robin is active mostly
during the day and assembles in large
flocks at night. It is one of the earliest
bird species to lay eggs, beginning to
breed shortly after returning to its
summer range from its winter range. Its
nest consists of long coarse grass,
twigs, paper, and feathers, and is
smeared with mud and often
cushioned with grass or other soft
materials. It is among the first birds to
sing at dawn.

•
•
•
•
•
•

flocks
bird
eggs
nest
sing
species
Beautiful Soup

GNRD

Resolver
Beautiful Soup

GNRD

Resolver
From GNRD
names_list = *“Pandarus sinuatus”,“Pandarus smithii”+
genera = []
for name in name_list:
row = name.split(‘ ‘)
genera.append(row[0])
genera = *“Pandarus”,”Pandarus”+
genera = *“Pandarus”,”Pandarus”+
i = -1
genus_index_list = []
for genus in genera:
genus_text = tokens[i+1:]
genus_index = genus_text.index(genus)
if i == -1:
genus_index_list.append(genus_index)
else:
genus_index = genus_index + i + 1
genus_index_list.append(genus_index)
i = genus_index
genus_index = [36,39]
genus_index = [36,39]
for index in genus_index_list:
species = *‘ ‘.join(tokens*index:index+2])]
#Join the genus to the word immediately following.
if species == name_list[counter]:
#Does this match the name_list?
tokens[index:index+2+ = *‘ ‘.join(tokens*index:index+2])]
#If yes, combine the two into one element
tokens =
*‘Great’, ‘white’, ‘sharks’, ‘are’, ‘apex’, ‘pre
dators’, ‘,’, ‘meaning’, ‘they’, ‘have’, ‘a’, ‘lar
ge’, ‘effect’, ‘on’, ‘the’, ‘populations’, ‘of’, ‘t
heir’, ‘prey’, ‘including’, ‘elephant’, ‘seals’,
‘and’, ‘sea’, ‘lions.’, ‘Great’, ‘white’, ‘sharks’
, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘c
opepods’, ‘(‘, ‘Pandarus
sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’+
name_index_list = [36,38]
Looking at the first relationship:

Carcharodon carcharias

Pandarus sinuatus

term_list = []
for name_index in name_index_list:
term_list = tokens[name_index-10:name_index+10]
term_list =
*‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’,
‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus
Looking at the first relationship:

Parasite/host
Carcharodon carcharias

Pandarus sinuatus

term_list =
*‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’,
‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus
smithii’, ‘)’, ‘.’+
Training Data
• Show the algorithm what “parasite/host”
words look like
• Compare to an unknown
• We want “Document Classification”
• Brown, Reuters and Movie Review
• We need to make our own corpus
Creating a Categorized Text Corpus
• http://www.packtpub.com/article/pythontext-processing-nltk-20-creating-customcorpora
• Inside “corpus” folder create new folder for
your corpus. Mine is “eco”.
• Build your corpus (start with EOL text)
• Make a category specification
• Lets start with parasitism and predation
Creating a Categorized Text Corpus
• eco
–
–
–
–
–
–
–
–

lion1
lion2
lion3
shark1
shark2
shark3
…
cats.txt

• in cats.txt
lion1.txt predation
lion2.txt parasitism
…
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’
reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|sharkd*.txt’,cat_file=‘cats.txt’)
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’
reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|sharkd*.txt’,cat_file=‘cats.txt’)

Choose a Corpus Reader
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’
reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|sharkd*.txt’,cat_file=‘cats.txt’)

Choose a Corpus Reader
You have to tell this Corpus Reader
Corpus root directory
File names (aka fileids)
Category specification
Next Steps
• Build corpus
• Build Feature Extractor
• Train Classifier
Build Feature Extractor
Train Classifier
Error Checking

More Related Content

More from Anne Thessen

Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...Anne Thessen
 
Unifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and EnvironmentsUnifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and EnvironmentsAnne Thessen
 
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...Anne Thessen
 
Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...Anne Thessen
 
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...Anne Thessen
 
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial EukaryotesNext-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial EukaryotesAnne Thessen
 
Linking biodiversity data for ecology
Linking biodiversity data for ecologyLinking biodiversity data for ecology
Linking biodiversity data for ecologyAnne Thessen
 
Data Infrastructure for Coastal and Estuarine Science
Data Infrastructure for Coastal and Estuarine ScienceData Infrastructure for Coastal and Estuarine Science
Data Infrastructure for Coastal and Estuarine ScienceAnne Thessen
 
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...Anne Thessen
 
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...Marrying models and data: Adventures in Modeling, Data Wrangling and Software...
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...Anne Thessen
 
Visualizing Evolution
Visualizing EvolutionVisualizing Evolution
Visualizing EvolutionAnne Thessen
 
The Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyThe Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyAnne Thessen
 
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of LifeKnowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of LifeAnne Thessen
 

More from Anne Thessen (13)

Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
 
Unifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and EnvironmentsUnifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and Environments
 
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
 
Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...
 
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
 
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial EukaryotesNext-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
 
Linking biodiversity data for ecology
Linking biodiversity data for ecologyLinking biodiversity data for ecology
Linking biodiversity data for ecology
 
Data Infrastructure for Coastal and Estuarine Science
Data Infrastructure for Coastal and Estuarine ScienceData Infrastructure for Coastal and Estuarine Science
Data Infrastructure for Coastal and Estuarine Science
 
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
 
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...Marrying models and data: Adventures in Modeling, Data Wrangling and Software...
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...
 
Visualizing Evolution
Visualizing EvolutionVisualizing Evolution
Visualizing Evolution
 
The Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyThe Future of Microalgal Taxonomy
The Future of Microalgal Taxonomy
 
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of LifeKnowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

Knowledge extraction from the Encyclopedia of Life using Python NLTK

  • 1. Knowledge extraction from the Encyclopedia of Life Using Python NLTK Anne Thessen annethessen@gmail.com
  • 2.
  • 3.
  • 5. Challenges Eastern Lowland Gorilla Gorilla berengei Gorilla beringei mikenensis Gorilla gorilla Gorilla beringei Matschie King kong ゴリラ Gorille 大猩猩 Virunga Горилла Gorilla graueri Koko Mountain gorilla Guerilla Gorila
  • 6. Challenges Contextual data Primate Monkey Eyes Food Panama Aotus nancymaae Contextual data Disambiguate by authority, species, contextual data Legume Plant Flower Mirbeliea Australia Aotus mollis
  • 8. • Common names • Interaction type
  • 9. • Common names • Interaction type
  • 10. Python NLTK • http://nltk.org/book/ • http://nltk.org/ • Install NLTK and NLTK Data
  • 11. Python NLTK • http://nltk.org/book/ • http://nltk.org/ • Install NLTK and NLTK Data
  • 12. • Natural Language Processing (NLP)
  • 13. • Natural Language Processing (NLP) • Semantic Statistics Robin is the name of several fictional characters appearing in comic books published by DC Comics, originally created by Bob Kane, Bill Finger and Jerry Robinson, as a junior counterpart to DC Comics superhero Batman. The team of Batman and Robin is commonly referred to as the Dynamic Duo or the Caped Crusaders. The American Robin is active mostly during the day and assembles in large flocks at night. It is one of the earliest bird species to lay eggs, beginning to breed shortly after returning to its summer range from its winter range. Its nest consists of long coarse grass, twigs, paper, and feathers, and is smeared with mud and often cushioned with grass or other soft materials. It is among the first birds to sing at dawn.
  • 14. Robin is the name of several fictional characters appearing in comic books published by DC Comics, originally created by Bob Kane, Bill Finger and Jerry Robinson, as a junior counterpart to DC Comics superhero Batman. The team of Batman and Robin is commonly referred to as the Dynamic Duo or the Caped Crusaders. • • • • • • • fictional comic books Bob Kane superhero Batman Dynamic Duo Caped Crusaders The American Robin is active mostly during the day and assembles in large flocks at night. It is one of the earliest bird species to lay eggs, beginning to breed shortly after returning to its summer range from its winter range. Its nest consists of long coarse grass, twigs, paper, and feathers, and is smeared with mud and often cushioned with grass or other soft materials. It is among the first birds to sing at dawn. • • • • • • flocks bird eggs nest sing species
  • 17.
  • 18. From GNRD names_list = *“Pandarus sinuatus”,“Pandarus smithii”+ genera = [] for name in name_list: row = name.split(‘ ‘) genera.append(row[0]) genera = *“Pandarus”,”Pandarus”+
  • 19. genera = *“Pandarus”,”Pandarus”+ i = -1 genus_index_list = [] for genus in genera: genus_text = tokens[i+1:] genus_index = genus_text.index(genus) if i == -1: genus_index_list.append(genus_index) else: genus_index = genus_index + i + 1 genus_index_list.append(genus_index) i = genus_index genus_index = [36,39]
  • 20. genus_index = [36,39] for index in genus_index_list: species = *‘ ‘.join(tokens*index:index+2])] #Join the genus to the word immediately following. if species == name_list[counter]: #Does this match the name_list? tokens[index:index+2+ = *‘ ‘.join(tokens*index:index+2])] #If yes, combine the two into one element
  • 21. tokens = *‘Great’, ‘white’, ‘sharks’, ‘are’, ‘apex’, ‘pre dators’, ‘,’, ‘meaning’, ‘they’, ‘have’, ‘a’, ‘lar ge’, ‘effect’, ‘on’, ‘the’, ‘populations’, ‘of’, ‘t heir’, ‘prey’, ‘including’, ‘elephant’, ‘seals’, ‘and’, ‘sea’, ‘lions.’, ‘Great’, ‘white’, ‘sharks’ , ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘c opepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’+
  • 22. name_index_list = [36,38] Looking at the first relationship: Carcharodon carcharias Pandarus sinuatus term_list = [] for name_index in name_index_list: term_list = tokens[name_index-10:name_index+10] term_list = *‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus
  • 23. Looking at the first relationship: Parasite/host Carcharodon carcharias Pandarus sinuatus term_list = *‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’+
  • 24. Training Data • Show the algorithm what “parasite/host” words look like • Compare to an unknown • We want “Document Classification” • Brown, Reuters and Movie Review • We need to make our own corpus
  • 25. Creating a Categorized Text Corpus • http://www.packtpub.com/article/pythontext-processing-nltk-20-creating-customcorpora • Inside “corpus” folder create new folder for your corpus. Mine is “eco”. • Build your corpus (start with EOL text) • Make a category specification • Lets start with parasitism and predation
  • 26. Creating a Categorized Text Corpus • eco – – – – – – – – lion1 lion2 lion3 shark1 shark2 shark3 … cats.txt • in cats.txt lion1.txt predation lion2.txt parasitism …
  • 27. from nltk.corpus.reader import CategorizedPlaintextCorpusReader corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’ reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|sharkd*.txt’,cat_file=‘cats.txt’)
  • 28. from nltk.corpus.reader import CategorizedPlaintextCorpusReader corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’ reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|sharkd*.txt’,cat_file=‘cats.txt’) Choose a Corpus Reader
  • 29. from nltk.corpus.reader import CategorizedPlaintextCorpusReader corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’ reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|sharkd*.txt’,cat_file=‘cats.txt’) Choose a Corpus Reader You have to tell this Corpus Reader Corpus root directory File names (aka fileids) Category specification
  • 30. Next Steps • Build corpus • Build Feature Extractor • Train Classifier