Biopython programming workshop at UGA

Sequences and alignments
NCBI EUtils and BLAST
Phylogenetics
Protein structures

IOB Workshop: Biopython
A programming toolkit for bioinformatics

Eric Talevich

Institute of Bioinformatics, University of Georgia

Mar. 29, 2012

Eric Talevich IOB Workshop: Biopython

Phylogenetics
Protein structures

Getting started
with


Phylogenetics
Protein structures

Installing Python

Biopython is a library for the Python programming language.

First, you’ll need these installed:
Python 2.7 from http://python.org. It may already be
installed on your computer. (Version 2.6 is OK, too.)
IDLE, a simple Integrated DeveLopment Environment.
Usually bundled with the Python distribution.

Now, start an interactive session in IDLE. 1

1
On your own, check out IPython (http://ipython.scipy.org/). It’s an
enhanced Python interpreter that feels somewhat like R.

Phylogenetics
Protein structures

Installing Python packages

Biopython is a Python package. There are a few standard ways to
install Python packages:
From source: Download from PyPI 2 , unpack and install with the
included setup.py script.
easy install: Install from source 3 , then use the easy install
command to fetch install all other packages by name:
$ easy install <package name>
pip: Like easy install, use pip 4 to manage packages:
$ pip install <package name>

2
http://pypi.python.org/pypi/
3
http://pypi.python.org/pypi/setuptools
4
http://pypi.python.org/pypi/pip

Phylogenetics
Protein structures

Installing NumPy, matplotlib and Biopython

Biopython relies on a few other Python packages for extra
functionality. We’ll use these:
numpy — eﬃcient numerical functions and data structures
(for Bio.PDB)
matplotlib — plotting (for Bio.Phylo)

Then ﬁnally:
biopython — the reason we’re here today

(Biopython, NumPy, matplotlib, setuptools and pip are also packaged for
many Linux distributions.)


Phylogenetics
Protein structures

Testing

Check your Biopython installation:
>>> import Bio
>>> print Bio. version

Import a NumPy-based component:
>>> from Bio import PDB

Show a simple plot:
>>> from matplotlib import pyplot
>>> pyplot.plot(range(5), range(5))
>>> pyplot.show()


Phylogenetics
Protein structures

Let’s start using


Phylogenetics
Protein structures

Biopython
1 Sequences and alignments
The Seq object
SeqIO and the SeqRecord object

2 NCBI EUtils and BLAST
EUtils: Entrez Programming Utilities
NCBI Blast
External programs

3 Phylogenetics

4 Protein structures


NCBI EUtils and BLAST The Seq object
Phylogenetics SeqIO and the SeqRecord object
Protein structures

Sequences
and
Alignments


Protein structures

The Seq object

>>> from Bio.Seq import Seq
>>> myseq = Seq(’AGTACACTGGT’)
>>> myseq
Seq(’AGTACACTGGT’, Alphabet())
>>> print myseq
AGTACACTGGT
>>> myseq.transcribe()
Seq(’AGUACACUGGU’, RNAAlphabet())
>>> myseq.translate()
Seq(’STL’, ExtendedIUPACProtein())


Protein structures

A Seq object consists of:
data — the underlying Python character string
alphabet — DNA, RNA, protein, etc.

It supports most Python string methods:
>>> myseq.count(’GT’)
2

And some biology-speciﬁc methods, too:
>>> myseq.reverse complement()
Seq(’ACCAGTGTACT’, Alphabet())

Intrigued? Read on:
>>> help(Seq)


Protein structures

SeqIO: Sequence Input/Output

Sequence data is stored in many different file formats.
Bio.SeqIO supports:

abi fastq phylip swiss
ace genbank pir tab
clustal ig qual uniprot-xml
embl imgt seqxml
emboss nexus sff
fasta phd stockholm

Manually fetch some data from the PDB website: 5

1ATP.fasta — two protein sequences, FASTA format
1ATP.pdb — the 3D structure, for later
5
http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATP

Protein structures

The SeqIO API

SeqIO provides four functions:
parse: Iteratively parse all elements in the file
read: Parse a one-element file and return the element
write: Write elements to a file
convert: Parse one format and immediately write another

Biopython uses the same I/O conventions for alignments
(AlignIO), BLAST results (Blast), and phylogenetic trees
(Phylo).


Protein structures

The SeqRecord object

SeqIO.parse returns SeqRecords.
SeqRecord wraps a Seq object and attaches metadata.

1 Pass the ﬁle name to the SeqIO parser; specify FASTA format:
from Bio import SeqIO
seqrecs = SeqIO.parse("1ATP.fasta", "fasta")
print seqrecs


Protein structures

The SeqRecord object

SeqIO.parse returns SeqRecords.
SeqRecord wraps a Seq object and attaches metadata.

1 Pass the ﬁle name to the SeqIO parser; specify FASTA format:
seqrecs = SeqIO.parse("1ATP.fasta", "fasta")
print seqrecs
2 To see all records at once, convert the iterator to a list:
allrecs = list(seqrecs)
print allrecs[0]
print allrecs[0].seq


Protein structures

Example: Shuﬄed sequences

Given a real DNA sequence, create a “background” set of
randomized sequences with the same composition.
Procedure:
1 Read the source sequence from a ﬁle
– Use Bio.SeqIO


Protein structures


Procedure:
– Use Bio.SeqIO
2 In a loop:
Shuﬄe the sequence
– Use random.shuffle from Python’s standard library
Create a new SeqRecord from the shuﬄed sequence
– Because SeqIO.write works with SeqRecords


Protein structures


Procedure:
– Use Bio.SeqIO
2 In a loop:
Shuffle the sequence
– Use random.shuffle from Python’s standard library
Create a new SeqRecord from the shuffled sequence
– Because SeqIO.write works with SeqRecords
3 Write the shuffled SeqRecords to another file


Protein structures

import random
from Bio . Seq import Seq
from Bio . SeqRecord import SeqRecord

o r i g r e c = SeqIO . r e a d ( "gi2.gb" , " genbank " )
alphabet = o r i g r e c . seq . alphabet
out recs = []
for i in xrange (1 , 31):
n u c l e o t i d e s = l i s t ( o r i g r e c . seq )
random . s h u f f l e ( n u c l e o t i d e s )
n e w s e q = Seq ( "" . j o i n ( n u c l e o t i d e s ) , a l p h a b e t )
n e w r e c = SeqRecord ( new seq ,
i d=" shuffle " + s t r ( i ) )
o u t r e c s . append ( n e w r e c )

SeqIO . w r i t e ( o u t r e c s , " gi2_shuffled . fasta " , " fasta " )


Protein structures

Example: ORF translation

Split a set of unannotated DNA sequences into unique
ORFs, translating in all 6 frames.
Biopython can help with each piece of this problem:
1 Parse the given unannotated DNA sequences (SeqIO.parse)
2 Get the template strand’s sequence (Seq.reverse complement)
3 Translate both strands into protein sequences (Seq.translate)
4 Shift each strand by +1 and +2 for alternate reading frames
(string-like Seq slicing)
5 Split sequences at stop codons (Seq.split(’*’))
6 Write translated sequences to a new ﬁle (SeqIO.write)


Protein structures

def t r a n s l a t e s i x f r a m e s ( seq , t a b l e =1):
””” T r a n s l a t e a n u c l e o t i d e s e q u e n c e i n 6 f r a m e s .

R e t u r n s an i t e r a b l e o f 6 t r a n s l a t e d p r o t e i n
sequences .
”””
rev = seq . reverse complement ()
for i in range ( 3 ) :
# Coding ( C r i c k ) s t r a n d
y i e l d seq [ i : ] . t r a n s l a t e ( t a b l e )
# Template ( Watson ) s t r a n d
y i e l d rev [ i : ] . t ransla te ( table )


Protein structures

def t r a n s l a t e o r f s ( s e q u e n c e s , m i n p r o t l e n =60):
””” F i n d and t r a n s l a t e a l l ORFs i n s e q u e n c e s .

T r a n s l a t e s each sequence i n a l l 6 r e a d i n g frames ,
s p l i t s s e q u e n c e s on s t o p codons , and p r o d u c e s an
i t e r a b l e of a l l p r o t e i n sequences of length at
least min prot len .
”””
for seq in sequences :
for frame in t r a n s l a t e s i x f r a m e s ( seq ) :
f o r p r o t i n f r a m e . s p l i t ( "*" ) :
i f l e n ( p r o t ) >= m i n p r o t l e n :
y i e l d prot


Protein structures


if name == " __main__ " :
import s y s
i n f i l e = sys . stdin
o u t f i l e = sys . stdout
r e c o r d s = SeqIO . p a r s e ( i n f i l e , " fasta " )
seqs = ( rec . seq for rec in r e c o r d s )
proteins = t r a n s l a t e o r f s ( seqs )
s e q r e c s = ( SeqRecord ( seq , i d="orf"+s t r ( i ) )
for i , seq in enumerate ( o r f s ) )
SeqIO . w r i t e ( s r e c s , o u t f i l e , " fasta " )


Protein structures

AlignIO and the Alignment object

Alignment: a set of sequences with the same length and alphabet.
Use AlignIO just like SeqIO:
>>> from Bio import AlignIO
>>> aln = AlignIO.read("PF01601.sto", "stockholm")
>>> print aln
SingleLetterAlphabet() alignment with 22 rows and 730 columns
NCTDAV-----LTYSSFGVCADGSIIA-VQPRNV-----SYDSV...HIQ Q1HVL3 CVH22/539-1170
NCTTAV-----MTYSNFGICADGSLIP-VRPRNS-----SDNGI...HVQ SPIKE CVHNL/723-1356
NCTEPV-----LVYSNIGVCKSGSIGY-VPSQS------GQVKI...HVQ Q692M1 9CORO/740-1383
NCTEPA-----LVYSNIGVCKNGAIGL-VGIRN------TQPKI...HIQ Q0Q4F4 9CORO/729-1360
NCTSPR-----LVYSNIGVCTSGAIGL-LSPKX------AQPQI...HVQ Q0Q4F6 9CORO/743-1371
NCTNPV-----LTYSSYGVCPDGSITR-LGLTD------VQPHF...--T A4ULL0 9CORO/726-1328
NCTKPV-----LSYGPISVCSDGAIAG-TSTLQN-----TRPSI...KEW A6N263 9CORO/406-1035
ECDIPIGAGICASYHTVSLLRSTSQKSIVAYTMS------LGAD...HYT Q6T7X8 CVHSA/647-1255
...

DCE-PV-----ITYSNIGVCKNGAFVF-INVTH------SDGDV...HVH Q0PKZ5 CVPPU/797-1449


Protein structures

Snack Time


NCBI Blast
Phylogenetics
External programs
Protein structures

EUtils and BLAST


NCBI Blast
Phylogenetics
External programs
Protein structures


Access NCBI’s online services:
from Bio import Entrez
Entrez.email = "you@uga.edu"


NCBI Blast
Phylogenetics
External programs
Protein structures



Request a GenBank record:
handle = Entrez.efetch(db="protein", id="69316",
rettype="gb", retmode="text")
record = SeqIO.read(handle, "gb")


NCBI Blast
Phylogenetics
External programs
Protein structures



Request a GenBank record:
handle = Entrez.efetch(db="protein", id="69316",
rettype="gb", retmode="text")
record = SeqIO.read(handle, "gb")

Specify multiple IDs in one query:
handle = Entrez.efetch(db="protein",
id="349839,349840",
rettype="fasta", retmode="text")
records = SeqIO.parse(handle, "fasta")

NCBI Blast
Phylogenetics
External programs
Protein structures

Interlude: SeqRecord attributes
seq: the sequence (Seq) itself
id: primary ID for the sequence, e.g. accession number
(string)
name: “common” name/id for the sequence, like GenBank
LOCUS id
description: human-readible description of the sequence
letter annotations: restricted dictionary of additional info about
individual letters in the sequence, e.g. quality scores
annotations: dictionary of additional unstructured info
features: list of SeqFeature objects with more structured
information — e.g. position of genes on a genome,
domains on a protein sequence.
dbxrefs: list of database cross-references (strings)

NCBI Blast
Phylogenetics
External programs
Protein structures

from Bio import E n t r e z , SeqIO
E n t r e z . e m a i l = " me@uga .edu"

h a n d l e = E n t r e z . e f e t c h ( db=" nucleotide " , i d=" M95169 " ,
r e t t y p e="gb" , r et m od e="text" )
r e c o r d = SeqIO . r e a d ( h a n d l e , " genbank " )
handle . c l o s e ()
print record
print record . features [10]
s l i c e d = record [20000:] # L a s t ˜25% o f t h e genome
print s l i c e d

from Bio . A l p h a b e t import g e n e r i c p r o t e i n
t r a n s l a t i o n s = [ f . q u a l i f i e r s [ " translation " ]
for f in record . f e a t u r e s [ 1 : ] ]
p r o t e i n s = [ Seq ( t [ 0 ] , g e n e r i c p r o t e i n )
for t in t r a n s l a t i o n s ]


NCBI Blast
Phylogenetics
External programs
Protein structures

NCBI Blast

BLAST can be used either standalone or through NCBI’s server.
Online: >>> from Bio.Blast import NCBIWWW
>>> result handle = NCBIWWW.qblast(
’blastp’, ’nr’, query string)
Standalone: “Legacy” (blastall):
>>> from Bio.Blast.Applications import
BlastallCommandline
>>> help(BlastallCommandline)
New hotness (Blast+):
>>> from Bio.Blast.Applications import
NcbiblastpCommandline
>>> help(NcbiblastpCommandline)


NCBI Blast
Phylogenetics
External programs
Protein structures

Parsing BLAST output

BLAST produces reports in plain-text and XML format.

Biopython requests XML by default.
>>> from Bio.Blast import NCBIWWW, NCBIXML
>>> result handle = NCBIWWW.qblast(’blastp’,
... ’nr’, query string)
>>> blast record = NCBIXML.read(result handle)
>>> print blast record


NCBI Blast
Phylogenetics
External programs
Protein structures

# S e a r c h f o r homologs o f a p r o t e i n s e q u e n c e

from Bio . B l a s t import NBCIWWW, NCBIXML

# Read and r e f o r m a t t h e q u e r y s e q u e n c e
s e q r e c = SeqIO . r e a d ( ’gi2.gb ’ , ’gb ’ )
q u e r y = s e q r e c . f o r m a t ( ’fasta ’ )

# Submit an o n l i n e BLAST q u e r y
# ( T h i s t a k e s some t i m e t o r u n )
r e s u l t h a n d l e = NCBIWWW. q b l a s t ( ’blastx ’ , ’nr ’ , q u e r y )


NCBI Blast
Phylogenetics
External programs
Protein structures

# 1 . Save t h e BLAST r e s u l t s a s an XML f i l e

w i t h open ( ’aprotinin_blast .xml ’ , ’w’ ) a s s a v e f i l e :
s a v e f i l e . write ( r e s u l t h a n d l e . read ())
result handle . close ()

# NB : The BLAST r e s u l t h a n d l e can o n l y be r e a d once
# R e l o a d i t from t h e f i l e
w i t h open ( ’aprotinin_blast .xml ’ ) a s r e s u l t h a n d l e :
b l a s t r e c o r d = NCBIXML . r e a d ( r e s u l t h a n d l e )


NCBI Blast
Phylogenetics
External programs
Protein structures

# 2 . D i s p l a y a h i s t o g r a m o f BLAST h i t s c o r e s

def g e t s c o r e s ( a l i g n m e n t s ) :
for aln in alignments :
f o r hsp i n a l n . h s p s :
y i e l d hsp . s c o r e

scores = l i s t ( get scores ( blast record . alignments ))

# Draw t h e h i s t o g r a m
import p y l a b
p y l a b . h i s t ( s c o r e s , b i n s =20)
p y l a b . t i t l e ( " Scores of %d BLAST hits" % l e n ( s c o r e s ) )
p y l a b . x l a b e l ( " BLAST score " )
p y l a b . y l a b e l ( "# hits" )
p y l a b . show ( )
# Save a copy f o r l a t e r
p y l a b . s a v e f i g ( ’aprotinin_scores .png ’ )


NCBI Blast
Phylogenetics
External programs
Protein structures

Figure: Histogram of BLAST scores generated by pylab

NCBI Blast
Phylogenetics
External programs
Protein structures

# 3 . E x t r a c t t h e s e q u e n c e s o f h i g h −s c o r i n g BLAST h i t s


def g e t h s p s ( a l i g n m e n t s , t h r e s h o l d ) :
for aln in alignments :
f o r hsp i n a l n . h s p s :
i f hsp . s c o r e >= t h r e s h o l d :
y i e l d SeqRecord ( Seq ( hsp . s b j c t ) ,
i d=a l n . a c c e s s i o n )
break

b e s t s e q s = g e t h s p s ( b l a s t r e c o r d . alignments , 321)
SeqIO . w r i t e ( b e s t s e q s , ’aprotinin . fasta ’ , ’fasta ’ )


NCBI Blast
Phylogenetics
External programs
Protein structures

Calling other external programs

Biopython has wrappers for other command-line programs in:
Bio.Blast.Applications — the Blast+ suite
Bio.Align.Applications — Muscle, ClustalW, . . .
Bio.Emboss.Applications — needle, water, . . .

Let’s re-align our BLAST results using Muscle, and format the
alignment for use with stand-alone Phylip.


Phylogenetics
Protein structures

from Bio import A l i g n I O
from Bio . A l i g n . A p p l i c a t i o n s import MuscleCommandline
from S t r i n g I O import S t r i n g I O

# C o n s t r u c t t h e s h e l l command
muscle cmd = MuscleCommandline ( i n p u t=" aprotinin . fasta " )
# E x e c u t e t h e command
# Get o u t p u t ( t h e a l i g n m e n t ) and any e r r o r m e s s a g e s
m u s c l e o u t , m u s c l e e r r = muscle cmd ( )

# Read t h e a l i g n m e n t back i n
a l i g n = A l i g n I O . r e a d ( S t r i n g I O ( m u s c l e o u t ) , " fasta " )
# Format t h e a l i g n m e n t f o r P h y l i p
A l i g n I O . w r i t e ( [ a l i g n ] , ’aprotinin .phy ’ , ’phylip ’ )


Phylogenetics
Protein structures

Phylogenetics


Phylogenetics
Protein structures

Phylogenetic tree I/O

Start with:
>>> from Bio import Phylo

Input and output of trees is just like SeqIO:
read, parse single or multiple trees in Newick, Nexus and
PhyloXML formats
write to any of the formats supported by read/parse
convert between two formats in one step

Use StringIO to load strings directly:
>>> from cStringIO import StringIO
>>> handle = StringIO("((A,B),(C,(D,E)));")
>>> tree = Phylo.read(handle, "newick")


Phylogenetics
Protein structures

What’s in a tree?

Make a tree with branch lengths:
>>> tree = Phylo.read(StringIO("((A:1,B:1):2,
... (C:2,(D:1,E:1):1):1);"), "newick")

View the object structure of the entire tree:
>>> print tree

Draw an “ASCII-art” (plain text) representation:
>>> Phylo.draw ascii(tree)

. . . OK, let’s do it properly now:
>>> Phylo.draw(tree)


Phylogenetics
Protein structures

Modify the tree

Check the tree object for its methods:
>>> help(tree)

Try a few:
>>> tree.get terminals()
>>> clade = tree.common ancestor("A", "B")
>>> clade.color = "red"
>>> tree.root with outgroup("D", "E")
>>> tree.ladderize()
>>> Phylo.draw(tree)


Phylogenetics
Protein structures

External applications

Biopython wraps a number of external programs for phylogenetics.
We’re not going to use them now, but here’s where to ﬁnd them:
Bio.Phylo.PAML — PAML wrappers & helpers
Bio.Phylo.Applications — command-line wrapper for PhyML
(PhymlCommandline); RAxML and others on the
way. (Anything you’d like to see sooner?)
Bio.Emboss.Applications — other tools ported via Embassy,
including Phylip


Phylogenetics
Protein structures

Protein
structures


Phylogenetics
Protein structures

Going 3D: The PDB module
Load a structure:
>>> parser = PDB.PDBParser()
>>> struct = parser.get structure(’1ATP’,
’1ATP.pdb’)


Phylogenetics
Protein structures

Going 3D: The PDB module
Load a structure:
’1ATP.pdb’)
Inspect the object hierarchy:
>>> list(struct)
>>> model = struct[0]
>>> list(model)
>>> chain = model[’E’]
>>> list(chain)
>>> residue = chain[15]
>>> list(residue)

Phylogenetics
Protein structures

Figure: The “SMCRA” object hierarchy


Phylogenetics
Protein structures

Extracting a peptide sequence

Get the amino acid sequence through a Polypeptide object:
... ’1ATP.pdb’)
>>> ppb = PDB.PPBuilder()
>>> peptides = ppb.build peptides(struct)
>>> for pep in peptides:
... print pep.get sequence()


Phylogenetics
Protein structures

Calculating RMSD
Given two aligned structures, filter a list of target
residues for high RMS deviation.

Input: list of residue positions (integers)
two equivalent chains from aligned protein
models — residue numbers must match
Minimum RMSD value (float)
Output: list of residue positions, filtered
Procedure: 1 Extract coordinates of Cα atoms
2 If available (not glycine), extract Cβ
coordinates, too
3 Use Bio.SVDSuperimposer to calculate the
RMSD between coordinates
4 Compare to the given RMSD threshold

Phylogenetics
Protein structures

from Bio . SVDSuperimposer import SVDSuperimposer
from numpy import a r r a y

def f i l t r m s ( r e s i d s , r e f c h a i n , cmpchain , t h r e s h = 0 . 5 ) :
s u p e r = SVDSuperimposer ( )
for res in r e s i d s :
refres = refchain [ res ]
cmpres = cmpchain [ r e s ]
c o o r d 1 = [ r e f r e s [ ’CA ’ ] . g e t c o o r d ( ) ]
c o o r d 2 = [ cmpres [ ’CA ’ ] . g e t c o o r d ( ) ]
i f r e f r e s . h a s i d ( ’CB ’ ) and cmpres . h a s i d ( ’CB ’ ) :
# Not g l y c i n e
c o o r d 1 . append ( r e f r e s [ ’CB ’ ] . g e t c o o r d ( ) )
c o o r d 2 . append ( c m p r e s [ ’CB ’ ] . g e t c o o r d ( ) )
super . s e t ( a r r a y ( coord1 ) , a r r a y ( coord2 ))
rmsd = s u p e r . g e t i n i t r m s ( )
i f rmsd >= t h r e s h o l d :
yield res


Phylogenetics
Protein structures

Figure: Superimposed structures, with selected deviating residues

Phylogenetics
Protein structures

Further reading

Biopython tutorial:
http:
//biopython.org/DIST/docs/tutorial/Tutorial.html
Biopython wiki:
http://biopython.org/
This presentation:
http://www.slideshare.net/etalevich/
biopython-programming-workshop-at-uga


Phylogenetics
Protein structures

Thanks
’Preciate it.
Gracias


Biopython programming workshop at UGA

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Similar to Biopython programming workshop at UGA

Similar to Biopython programming workshop at UGA (20)

Recently uploaded

Recently uploaded (20)

Biopython programming workshop at UGA