A workshop on bioinformatics programming using Biopython and the Python programming language, held at the University of Georgia in Spring 2010 and 2012. These workshops are part of a series for the Institute of Bioinformatics (IoB) and Bioinformatics Grad Student Association (BIGSA) at UGA.
1. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
IOB Workshop: Biopython
A programming toolkit for bioinformatics
Eric Talevich
Institute of Bioinformatics, University of Georgia
Mar. 29, 2012
Eric Talevich IOB Workshop: Biopython
2. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Getting started
with
Eric Talevich IOB Workshop: Biopython
3. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Installing Python
Biopython is a library for the Python programming language.
First, you’ll need these installed:
Python 2.7 from http://python.org. It may already be
installed on your computer. (Version 2.6 is OK, too.)
IDLE, a simple Integrated DeveLopment Environment.
Usually bundled with the Python distribution.
Now, start an interactive session in IDLE. 1
1
On your own, check out IPython (http://ipython.scipy.org/). It’s an
enhanced Python interpreter that feels somewhat like R.
Eric Talevich IOB Workshop: Biopython
4. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Installing Python packages
Biopython is a Python package. There are a few standard ways to
install Python packages:
From source: Download from PyPI 2 , unpack and install with the
included setup.py script.
easy install: Install from source 3 , then use the easy install
command to fetch install all other packages by name:
$ easy install <package name>
pip: Like easy install, use pip 4 to manage packages:
$ pip install <package name>
2
http://pypi.python.org/pypi/
3
http://pypi.python.org/pypi/setuptools
4
http://pypi.python.org/pypi/pip
Eric Talevich IOB Workshop: Biopython
5. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Installing NumPy, matplotlib and Biopython
Biopython relies on a few other Python packages for extra
functionality. We’ll use these:
numpy — efficient numerical functions and data structures
(for Bio.PDB)
matplotlib — plotting (for Bio.Phylo)
Then finally:
biopython — the reason we’re here today
(Biopython, NumPy, matplotlib, setuptools and pip are also packaged for
many Linux distributions.)
Eric Talevich IOB Workshop: Biopython
6. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Testing
Check your Biopython installation:
>>> import Bio
>>> print Bio. version
Import a NumPy-based component:
>>> from Bio import PDB
Show a simple plot:
>>> from matplotlib import pyplot
>>> pyplot.plot(range(5), range(5))
>>> pyplot.show()
Eric Talevich IOB Workshop: Biopython
7. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Let’s start using
Eric Talevich IOB Workshop: Biopython
8. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Biopython
1 Sequences and alignments
The Seq object
SeqIO and the SeqRecord object
2 NCBI EUtils and BLAST
EUtils: Entrez Programming Utilities
NCBI Blast
External programs
3 Phylogenetics
4 Protein structures
Eric Talevich IOB Workshop: Biopython
9. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
Sequences
and
Alignments
Eric Talevich IOB Workshop: Biopython
10. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
The Seq object
>>> from Bio.Seq import Seq
>>> myseq = Seq(’AGTACACTGGT’)
>>> myseq
Seq(’AGTACACTGGT’, Alphabet())
>>> print myseq
AGTACACTGGT
>>> myseq.transcribe()
Seq(’AGUACACUGGU’, RNAAlphabet())
>>> myseq.translate()
Seq(’STL’, ExtendedIUPACProtein())
Eric Talevich IOB Workshop: Biopython
11. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
A Seq object consists of:
data — the underlying Python character string
alphabet — DNA, RNA, protein, etc.
It supports most Python string methods:
>>> myseq.count(’GT’)
2
And some biology-specific methods, too:
>>> myseq.reverse complement()
Seq(’ACCAGTGTACT’, Alphabet())
Intrigued? Read on:
>>> help(Seq)
Eric Talevich IOB Workshop: Biopython
12. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
SeqIO: Sequence Input/Output
Sequence data is stored in many different file formats.
Bio.SeqIO supports:
abi fastq phylip swiss
ace genbank pir tab
clustal ig qual uniprot-xml
embl imgt seqxml
emboss nexus sff
fasta phd stockholm
Manually fetch some data from the PDB website: 5
1ATP.fasta — two protein sequences, FASTA format
1ATP.pdb — the 3D structure, for later
5
http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATP
Eric Talevich IOB Workshop: Biopython
13. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
The SeqIO API
SeqIO provides four functions:
parse: Iteratively parse all elements in the file
read: Parse a one-element file and return the element
write: Write elements to a file
convert: Parse one format and immediately write another
Biopython uses the same I/O conventions for alignments
(AlignIO), BLAST results (Blast), and phylogenetic trees
(Phylo).
Eric Talevich IOB Workshop: Biopython
14. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
The SeqRecord object
SeqIO.parse returns SeqRecords.
SeqRecord wraps a Seq object and attaches metadata.
1 Pass the file name to the SeqIO parser; specify FASTA format:
from Bio import SeqIO
seqrecs = SeqIO.parse("1ATP.fasta", "fasta")
print seqrecs
Eric Talevich IOB Workshop: Biopython
15. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
The SeqRecord object
SeqIO.parse returns SeqRecords.
SeqRecord wraps a Seq object and attaches metadata.
1 Pass the file name to the SeqIO parser; specify FASTA format:
from Bio import SeqIO
seqrecs = SeqIO.parse("1ATP.fasta", "fasta")
print seqrecs
2 To see all records at once, convert the iterator to a list:
allrecs = list(seqrecs)
print allrecs[0]
print allrecs[0].seq
Eric Talevich IOB Workshop: Biopython
16. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
Example: Shuffled sequences
Given a real DNA sequence, create a “background” set of
randomized sequences with the same composition.
Procedure:
1 Read the source sequence from a file
– Use Bio.SeqIO
Eric Talevich IOB Workshop: Biopython
17. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
Example: Shuffled sequences
Given a real DNA sequence, create a “background” set of
randomized sequences with the same composition.
Procedure:
1 Read the source sequence from a file
– Use Bio.SeqIO
2 In a loop:
Shuffle the sequence
– Use random.shuffle from Python’s standard library
Create a new SeqRecord from the shuffled sequence
– Because SeqIO.write works with SeqRecords
Eric Talevich IOB Workshop: Biopython
18. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
Example: Shuffled sequences
Given a real DNA sequence, create a “background” set of
randomized sequences with the same composition.
Procedure:
1 Read the source sequence from a file
– Use Bio.SeqIO
2 In a loop:
Shuffle the sequence
– Use random.shuffle from Python’s standard library
Create a new SeqRecord from the shuffled sequence
– Because SeqIO.write works with SeqRecords
3 Write the shuffled SeqRecords to another file
Eric Talevich IOB Workshop: Biopython
19. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
import random
from Bio import SeqIO
from Bio . Seq import Seq
from Bio . SeqRecord import SeqRecord
o r i g r e c = SeqIO . r e a d ( "gi2.gb" , " genbank " )
alphabet = o r i g r e c . seq . alphabet
out recs = []
for i in xrange (1 , 31):
n u c l e o t i d e s = l i s t ( o r i g r e c . seq )
random . s h u f f l e ( n u c l e o t i d e s )
n e w s e q = Seq ( "" . j o i n ( n u c l e o t i d e s ) , a l p h a b e t )
n e w r e c = SeqRecord ( new seq ,
i d=" shuffle " + s t r ( i ) )
o u t r e c s . append ( n e w r e c )
SeqIO . w r i t e ( o u t r e c s , " gi2_shuffled . fasta " , " fasta " )
Eric Talevich IOB Workshop: Biopython
20. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
Example: ORF translation
Split a set of unannotated DNA sequences into unique
ORFs, translating in all 6 frames.
Biopython can help with each piece of this problem:
1 Parse the given unannotated DNA sequences (SeqIO.parse)
2 Get the template strand’s sequence (Seq.reverse complement)
3 Translate both strands into protein sequences (Seq.translate)
4 Shift each strand by +1 and +2 for alternate reading frames
(string-like Seq slicing)
5 Split sequences at stop codons (Seq.split(’*’))
6 Write translated sequences to a new file (SeqIO.write)
Eric Talevich IOB Workshop: Biopython
21. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
def t r a n s l a t e s i x f r a m e s ( seq , t a b l e =1):
””” T r a n s l a t e a n u c l e o t i d e s e q u e n c e i n 6 f r a m e s .
R e t u r n s an i t e r a b l e o f 6 t r a n s l a t e d p r o t e i n
sequences .
”””
rev = seq . reverse complement ()
for i in range ( 3 ) :
# Coding ( C r i c k ) s t r a n d
y i e l d seq [ i : ] . t r a n s l a t e ( t a b l e )
# Template ( Watson ) s t r a n d
y i e l d rev [ i : ] . t ransla te ( table )
Eric Talevich IOB Workshop: Biopython
22. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
def t r a n s l a t e o r f s ( s e q u e n c e s , m i n p r o t l e n =60):
””” F i n d and t r a n s l a t e a l l ORFs i n s e q u e n c e s .
T r a n s l a t e s each sequence i n a l l 6 r e a d i n g frames ,
s p l i t s s e q u e n c e s on s t o p codons , and p r o d u c e s an
i t e r a b l e of a l l p r o t e i n sequences of length at
least min prot len .
”””
for seq in sequences :
for frame in t r a n s l a t e s i x f r a m e s ( seq ) :
f o r p r o t i n f r a m e . s p l i t ( "*" ) :
i f l e n ( p r o t ) >= m i n p r o t l e n :
y i e l d prot
Eric Talevich IOB Workshop: Biopython
23. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
from Bio import SeqIO
from Bio . SeqRecord import SeqRecord
if name == " __main__ " :
import s y s
i n f i l e = sys . stdin
o u t f i l e = sys . stdout
r e c o r d s = SeqIO . p a r s e ( i n f i l e , " fasta " )
seqs = ( rec . seq for rec in r e c o r d s )
proteins = t r a n s l a t e o r f s ( seqs )
s e q r e c s = ( SeqRecord ( seq , i d="orf"+s t r ( i ) )
for i , seq in enumerate ( o r f s ) )
SeqIO . w r i t e ( s r e c s , o u t f i l e , " fasta " )
Eric Talevich IOB Workshop: Biopython
24. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
AlignIO and the Alignment object
Alignment: a set of sequences with the same length and alphabet.
Use AlignIO just like SeqIO:
>>> from Bio import AlignIO
>>> aln = AlignIO.read("PF01601.sto", "stockholm")
>>> print aln
SingleLetterAlphabet() alignment with 22 rows and 730 columns
NCTDAV-----LTYSSFGVCADGSIIA-VQPRNV-----SYDSV...HIQ Q1HVL3 CVH22/539-1170
NCTTAV-----MTYSNFGICADGSLIP-VRPRNS-----SDNGI...HVQ SPIKE CVHNL/723-1356
NCTEPV-----LVYSNIGVCKSGSIGY-VPSQS------GQVKI...HVQ Q692M1 9CORO/740-1383
NCTEPA-----LVYSNIGVCKNGAIGL-VGIRN------TQPKI...HIQ Q0Q4F4 9CORO/729-1360
NCTSPR-----LVYSNIGVCTSGAIGL-LSPKX------AQPQI...HVQ Q0Q4F6 9CORO/743-1371
NCTNPV-----LTYSSYGVCPDGSITR-LGLTD------VQPHF...--T A4ULL0 9CORO/726-1328
NCTKPV-----LSYGPISVCSDGAIAG-TSTLQN-----TRPSI...KEW A6N263 9CORO/406-1035
ECDIPIGAGICASYHTVSLLRSTSQKSIVAYTMS------LGAD...HYT Q6T7X8 CVHSA/647-1255
...
DCE-PV-----ITYSNIGVCKNGAFVF-INVTH------SDGDV...HVH Q0PKZ5 CVPPU/797-1449
Eric Talevich IOB Workshop: Biopython
25. Sequences and alignments
NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures
Snack Time
Eric Talevich IOB Workshop: Biopython
26. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
EUtils and BLAST
Eric Talevich IOB Workshop: Biopython
27. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
EUtils: Entrez Programming Utilities
Access NCBI’s online services:
from Bio import Entrez
Entrez.email = "you@uga.edu"
Eric Talevich IOB Workshop: Biopython
28. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
EUtils: Entrez Programming Utilities
Access NCBI’s online services:
from Bio import Entrez
Entrez.email = "you@uga.edu"
Request a GenBank record:
handle = Entrez.efetch(db="protein", id="69316",
rettype="gb", retmode="text")
record = SeqIO.read(handle, "gb")
Eric Talevich IOB Workshop: Biopython
29. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
EUtils: Entrez Programming Utilities
Access NCBI’s online services:
from Bio import Entrez
Entrez.email = "you@uga.edu"
Request a GenBank record:
handle = Entrez.efetch(db="protein", id="69316",
rettype="gb", retmode="text")
record = SeqIO.read(handle, "gb")
Specify multiple IDs in one query:
handle = Entrez.efetch(db="protein",
id="349839,349840",
rettype="fasta", retmode="text")
records = SeqIO.parse(handle, "fasta")
Eric Talevich IOB Workshop: Biopython
30. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
Interlude: SeqRecord attributes
seq: the sequence (Seq) itself
id: primary ID for the sequence, e.g. accession number
(string)
name: “common” name/id for the sequence, like GenBank
LOCUS id
description: human-readible description of the sequence
letter annotations: restricted dictionary of additional info about
individual letters in the sequence, e.g. quality scores
annotations: dictionary of additional unstructured info
features: list of SeqFeature objects with more structured
information — e.g. position of genes on a genome,
domains on a protein sequence.
dbxrefs: list of database cross-references (strings)
Eric Talevich IOB Workshop: Biopython
31. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
from Bio import E n t r e z , SeqIO
E n t r e z . e m a i l = " me@uga .edu"
h a n d l e = E n t r e z . e f e t c h ( db=" nucleotide " , i d=" M95169 " ,
r e t t y p e="gb" , r et m od e="text" )
r e c o r d = SeqIO . r e a d ( h a n d l e , " genbank " )
handle . c l o s e ()
print record
print record . features [10]
s l i c e d = record [20000:] # L a s t ˜25% o f t h e genome
print s l i c e d
from Bio . Seq import Seq
from Bio . A l p h a b e t import g e n e r i c p r o t e i n
t r a n s l a t i o n s = [ f . q u a l i f i e r s [ " translation " ]
for f in record . f e a t u r e s [ 1 : ] ]
p r o t e i n s = [ Seq ( t [ 0 ] , g e n e r i c p r o t e i n )
for t in t r a n s l a t i o n s ]
Eric Talevich IOB Workshop: Biopython
32. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
NCBI Blast
BLAST can be used either standalone or through NCBI’s server.
Online: >>> from Bio.Blast import NCBIWWW
>>> result handle = NCBIWWW.qblast(
’blastp’, ’nr’, query string)
Standalone: “Legacy” (blastall):
>>> from Bio.Blast.Applications import
BlastallCommandline
>>> help(BlastallCommandline)
New hotness (Blast+):
>>> from Bio.Blast.Applications import
NcbiblastpCommandline
>>> help(NcbiblastpCommandline)
Eric Talevich IOB Workshop: Biopython
33. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
Parsing BLAST output
BLAST produces reports in plain-text and XML format.
Biopython requests XML by default.
>>> from Bio.Blast import NCBIWWW, NCBIXML
>>> result handle = NCBIWWW.qblast(’blastp’,
... ’nr’, query string)
>>> blast record = NCBIXML.read(result handle)
>>> print blast record
Eric Talevich IOB Workshop: Biopython
34. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
# S e a r c h f o r homologs o f a p r o t e i n s e q u e n c e
from Bio import SeqIO
from Bio . B l a s t import NBCIWWW, NCBIXML
# Read and r e f o r m a t t h e q u e r y s e q u e n c e
s e q r e c = SeqIO . r e a d ( ’gi2.gb ’ , ’gb ’ )
q u e r y = s e q r e c . f o r m a t ( ’fasta ’ )
# Submit an o n l i n e BLAST q u e r y
# ( T h i s t a k e s some t i m e t o r u n )
r e s u l t h a n d l e = NCBIWWW. q b l a s t ( ’blastx ’ , ’nr ’ , q u e r y )
Eric Talevich IOB Workshop: Biopython
35. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
# 1 . Save t h e BLAST r e s u l t s a s an XML f i l e
w i t h open ( ’aprotinin_blast .xml ’ , ’w’ ) a s s a v e f i l e :
s a v e f i l e . write ( r e s u l t h a n d l e . read ())
result handle . close ()
# NB : The BLAST r e s u l t h a n d l e can o n l y be r e a d once
# R e l o a d i t from t h e f i l e
w i t h open ( ’aprotinin_blast .xml ’ ) a s r e s u l t h a n d l e :
b l a s t r e c o r d = NCBIXML . r e a d ( r e s u l t h a n d l e )
Eric Talevich IOB Workshop: Biopython
36. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
# 2 . D i s p l a y a h i s t o g r a m o f BLAST h i t s c o r e s
def g e t s c o r e s ( a l i g n m e n t s ) :
for aln in alignments :
f o r hsp i n a l n . h s p s :
y i e l d hsp . s c o r e
scores = l i s t ( get scores ( blast record . alignments ))
# Draw t h e h i s t o g r a m
import p y l a b
p y l a b . h i s t ( s c o r e s , b i n s =20)
p y l a b . t i t l e ( " Scores of %d BLAST hits" % l e n ( s c o r e s ) )
p y l a b . x l a b e l ( " BLAST score " )
p y l a b . y l a b e l ( "# hits" )
p y l a b . show ( )
# Save a copy f o r l a t e r
p y l a b . s a v e f i g ( ’aprotinin_scores .png ’ )
Eric Talevich IOB Workshop: Biopython
37. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
Figure: Histogram of BLAST scores generated by pylab
Eric Talevich IOB Workshop: Biopython
38. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
# 3 . E x t r a c t t h e s e q u e n c e s o f h i g h −s c o r i n g BLAST h i t s
from Bio . Seq import Seq
from Bio . SeqRecord import SeqRecord
def g e t h s p s ( a l i g n m e n t s , t h r e s h o l d ) :
for aln in alignments :
f o r hsp i n a l n . h s p s :
i f hsp . s c o r e >= t h r e s h o l d :
y i e l d SeqRecord ( Seq ( hsp . s b j c t ) ,
i d=a l n . a c c e s s i o n )
break
b e s t s e q s = g e t h s p s ( b l a s t r e c o r d . alignments , 321)
SeqIO . w r i t e ( b e s t s e q s , ’aprotinin . fasta ’ , ’fasta ’ )
Eric Talevich IOB Workshop: Biopython
39. Sequences and alignments
EUtils: Entrez Programming Utilities
NCBI EUtils and BLAST
NCBI Blast
Phylogenetics
External programs
Protein structures
Calling other external programs
Biopython has wrappers for other command-line programs in:
Bio.Blast.Applications — the Blast+ suite
Bio.Align.Applications — Muscle, ClustalW, . . .
Bio.Emboss.Applications — needle, water, . . .
Let’s re-align our BLAST results using Muscle, and format the
alignment for use with stand-alone Phylip.
Eric Talevich IOB Workshop: Biopython
40. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
from Bio import A l i g n I O
from Bio . A l i g n . A p p l i c a t i o n s import MuscleCommandline
from S t r i n g I O import S t r i n g I O
# C o n s t r u c t t h e s h e l l command
muscle cmd = MuscleCommandline ( i n p u t=" aprotinin . fasta " )
# E x e c u t e t h e command
# Get o u t p u t ( t h e a l i g n m e n t ) and any e r r o r m e s s a g e s
m u s c l e o u t , m u s c l e e r r = muscle cmd ( )
# Read t h e a l i g n m e n t back i n
a l i g n = A l i g n I O . r e a d ( S t r i n g I O ( m u s c l e o u t ) , " fasta " )
# Format t h e a l i g n m e n t f o r P h y l i p
A l i g n I O . w r i t e ( [ a l i g n ] , ’aprotinin .phy ’ , ’phylip ’ )
Eric Talevich IOB Workshop: Biopython
41. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Phylogenetics
Eric Talevich IOB Workshop: Biopython
42. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Phylogenetic tree I/O
Start with:
>>> from Bio import Phylo
Input and output of trees is just like SeqIO:
read, parse single or multiple trees in Newick, Nexus and
PhyloXML formats
write to any of the formats supported by read/parse
convert between two formats in one step
Use StringIO to load strings directly:
>>> from cStringIO import StringIO
>>> handle = StringIO("((A,B),(C,(D,E)));")
>>> tree = Phylo.read(handle, "newick")
Eric Talevich IOB Workshop: Biopython
43. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
What’s in a tree?
Make a tree with branch lengths:
>>> tree = Phylo.read(StringIO("((A:1,B:1):2,
... (C:2,(D:1,E:1):1):1);"), "newick")
View the object structure of the entire tree:
>>> print tree
Draw an “ASCII-art” (plain text) representation:
>>> Phylo.draw ascii(tree)
. . . OK, let’s do it properly now:
>>> Phylo.draw(tree)
Eric Talevich IOB Workshop: Biopython
44. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Modify the tree
Check the tree object for its methods:
>>> help(tree)
Try a few:
>>> tree.get terminals()
>>> clade = tree.common ancestor("A", "B")
>>> clade.color = "red"
>>> tree.root with outgroup("D", "E")
>>> tree.ladderize()
>>> Phylo.draw(tree)
Eric Talevich IOB Workshop: Biopython
45. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
External applications
Biopython wraps a number of external programs for phylogenetics.
We’re not going to use them now, but here’s where to find them:
Bio.Phylo.PAML — PAML wrappers & helpers
Bio.Phylo.Applications — command-line wrapper for PhyML
(PhymlCommandline); RAxML and others on the
way. (Anything you’d like to see sooner?)
Bio.Emboss.Applications — other tools ported via Embassy,
including Phylip
Eric Talevich IOB Workshop: Biopython
46. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Protein
structures
Eric Talevich IOB Workshop: Biopython
47. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Going 3D: The PDB module
Load a structure:
>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct = parser.get structure(’1ATP’,
’1ATP.pdb’)
Eric Talevich IOB Workshop: Biopython
48. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Going 3D: The PDB module
Load a structure:
>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct = parser.get structure(’1ATP’,
’1ATP.pdb’)
Inspect the object hierarchy:
>>> list(struct)
>>> model = struct[0]
>>> list(model)
>>> chain = model[’E’]
>>> list(chain)
>>> residue = chain[15]
>>> list(residue)
Eric Talevich IOB Workshop: Biopython
49. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Figure: The “SMCRA” object hierarchy
Eric Talevich IOB Workshop: Biopython
50. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Extracting a peptide sequence
Get the amino acid sequence through a Polypeptide object:
>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct = parser.get structure(’1ATP’,
... ’1ATP.pdb’)
>>> ppb = PDB.PPBuilder()
>>> peptides = ppb.build peptides(struct)
>>> for pep in peptides:
... print pep.get sequence()
Eric Talevich IOB Workshop: Biopython
51. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Calculating RMSD
Given two aligned structures, filter a list of target
residues for high RMS deviation.
Input: list of residue positions (integers)
two equivalent chains from aligned protein
models — residue numbers must match
Minimum RMSD value (float)
Output: list of residue positions, filtered
Procedure: 1 Extract coordinates of Cα atoms
2 If available (not glycine), extract Cβ
coordinates, too
3 Use Bio.SVDSuperimposer to calculate the
RMSD between coordinates
4 Compare to the given RMSD threshold
Eric Talevich IOB Workshop: Biopython
52. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
from Bio . SVDSuperimposer import SVDSuperimposer
from numpy import a r r a y
def f i l t r m s ( r e s i d s , r e f c h a i n , cmpchain , t h r e s h = 0 . 5 ) :
s u p e r = SVDSuperimposer ( )
for res in r e s i d s :
refres = refchain [ res ]
cmpres = cmpchain [ r e s ]
c o o r d 1 = [ r e f r e s [ ’CA ’ ] . g e t c o o r d ( ) ]
c o o r d 2 = [ cmpres [ ’CA ’ ] . g e t c o o r d ( ) ]
i f r e f r e s . h a s i d ( ’CB ’ ) and cmpres . h a s i d ( ’CB ’ ) :
# Not g l y c i n e
c o o r d 1 . append ( r e f r e s [ ’CB ’ ] . g e t c o o r d ( ) )
c o o r d 2 . append ( c m p r e s [ ’CB ’ ] . g e t c o o r d ( ) )
super . s e t ( a r r a y ( coord1 ) , a r r a y ( coord2 ))
rmsd = s u p e r . g e t i n i t r m s ( )
i f rmsd >= t h r e s h o l d :
yield res
Eric Talevich IOB Workshop: Biopython
53. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Figure: Superimposed structures, with selected deviating residues
Eric Talevich IOB Workshop: Biopython
54. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Further reading
Biopython tutorial:
http:
//biopython.org/DIST/docs/tutorial/Tutorial.html
Biopython wiki:
http://biopython.org/
This presentation:
http://www.slideshare.net/etalevich/
biopython-programming-workshop-at-uga
Eric Talevich IOB Workshop: Biopython
55. Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures
Thanks
’Preciate it.
Gracias
Eric Talevich IOB Workshop: Biopython