Biopython programming workshop at UGA
Upcoming SlideShare
Loading in...5
×
 

Biopython programming workshop at UGA

on

  • 7,267 views

A workshop on bioinformatics programming using Biopython and the Python programming language, held at the University of Georgia in Spring 2010 and 2012. These workshops are part of a series for the ...

A workshop on bioinformatics programming using Biopython and the Python programming language, held at the University of Georgia in Spring 2010 and 2012. These workshops are part of a series for the Institute of Bioinformatics (IoB) and Bioinformatics Grad Student Association (BIGSA) at UGA.

Statistics

Views

Total Views
7,267
Views on SlideShare
7,080
Embed Views
187

Actions

Likes
6
Downloads
191
Comments
0

14 Embeds 187

http://etalog.blogspot.com 130
http://www.slideshare.net 35
http://translate.googleusercontent.com 6
http://etalog.blogspot.nl 5
http://etalog.blogspot.ie 2
http://etalog.blogspot.fr 1
http://www.linkedin.com 1
http://www.slashdocs.com 1
http://etalog.blogspot.ru 1
http://etalog.blogspot.ca 1
https://si0.twimg.com 1
https://translate.googleusercontent.com 1
http://twitter.com 1
http://www.docshut.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Biopython programming workshop at UGA Biopython programming workshop at UGA Presentation Transcript

  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures IOB Workshop: BiopythonA programming toolkit for bioinformatics Eric Talevich Institute of Bioinformatics, University of Georgia Mar. 29, 2012 Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Getting started with Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresInstalling Python Biopython is a library for the Python programming language. First, you’ll need these installed: Python 2.7 from http://python.org. It may already be installed on your computer. (Version 2.6 is OK, too.) IDLE, a simple Integrated DeveLopment Environment. Usually bundled with the Python distribution. Now, start an interactive session in IDLE. 1 1 On your own, check out IPython (http://ipython.scipy.org/). It’s an enhanced Python interpreter that feels somewhat like R. Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresInstalling Python packages Biopython is a Python package. There are a few standard ways to install Python packages: From source: Download from PyPI 2 , unpack and install with the included setup.py script. easy install: Install from source 3 , then use the easy install command to fetch install all other packages by name: $ easy install <package name> pip: Like easy install, use pip 4 to manage packages: $ pip install <package name> 2 http://pypi.python.org/pypi/ 3 http://pypi.python.org/pypi/setuptools 4 http://pypi.python.org/pypi/pip Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresInstalling NumPy, matplotlib and Biopython Biopython relies on a few other Python packages for extra functionality. We’ll use these: numpy — efficient numerical functions and data structures (for Bio.PDB) matplotlib — plotting (for Bio.Phylo) Then finally: biopython — the reason we’re here today (Biopython, NumPy, matplotlib, setuptools and pip are also packaged for many Linux distributions.) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresTesting Check your Biopython installation: >>> import Bio >>> print Bio. version Import a NumPy-based component: >>> from Bio import PDB Show a simple plot: >>> from matplotlib import pyplot >>> pyplot.plot(range(5), range(5)) >>> pyplot.show() Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Let’s start using Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Biopython1 Sequences and alignments The Seq object SeqIO and the SeqRecord object2 NCBI EUtils and BLAST EUtils: Entrez Programming Utilities NCBI Blast External programs3 Phylogenetics4 Protein structures Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures Sequences and Alignments Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresThe Seq object >>> from Bio.Seq import Seq >>> myseq = Seq(’AGTACACTGGT’) >>> myseq Seq(’AGTACACTGGT’, Alphabet()) >>> print myseq AGTACACTGGT >>> myseq.transcribe() Seq(’AGUACACUGGU’, RNAAlphabet()) >>> myseq.translate() Seq(’STL’, ExtendedIUPACProtein()) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresA Seq object consists of: data — the underlying Python character string alphabet — DNA, RNA, protein, etc.It supports most Python string methods: >>> myseq.count(’GT’) 2And some biology-specific methods, too: >>> myseq.reverse complement() Seq(’ACCAGTGTACT’, Alphabet())Intrigued? Read on: >>> help(Seq) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresSeqIO: Sequence Input/Output Sequence data is stored in many different file formats. Bio.SeqIO supports: abi fastq phylip swiss ace genbank pir tab clustal ig qual uniprot-xml embl imgt seqxml emboss nexus sff fasta phd stockholm Manually fetch some data from the PDB website: 5 1ATP.fasta — two protein sequences, FASTA format 1ATP.pdb — the 3D structure, for later 5 http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATP Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresThe SeqIO API SeqIO provides four functions: parse: Iteratively parse all elements in the file read: Parse a one-element file and return the element write: Write elements to a file convert: Parse one format and immediately write another Biopython uses the same I/O conventions for alignments (AlignIO), BLAST results (Blast), and phylogenetic trees (Phylo). Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresThe SeqRecord object SeqIO.parse returns SeqRecords. SeqRecord wraps a Seq object and attaches metadata. 1 Pass the file name to the SeqIO parser; specify FASTA format: from Bio import SeqIO seqrecs = SeqIO.parse("1ATP.fasta", "fasta") print seqrecs Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresThe SeqRecord object SeqIO.parse returns SeqRecords. SeqRecord wraps a Seq object and attaches metadata. 1 Pass the file name to the SeqIO parser; specify FASTA format: from Bio import SeqIO seqrecs = SeqIO.parse("1ATP.fasta", "fasta") print seqrecs 2 To see all records at once, convert the iterator to a list: allrecs = list(seqrecs) print allrecs[0] print allrecs[0].seq Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresExample: Shuffled sequences Given a real DNA sequence, create a “background” set of randomized sequences with the same composition. Procedure: 1 Read the source sequence from a file – Use Bio.SeqIO Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresExample: Shuffled sequences Given a real DNA sequence, create a “background” set of randomized sequences with the same composition. Procedure: 1 Read the source sequence from a file – Use Bio.SeqIO 2 In a loop: Shuffle the sequence – Use random.shuffle from Python’s standard library Create a new SeqRecord from the shuffled sequence – Because SeqIO.write works with SeqRecords Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresExample: Shuffled sequences Given a real DNA sequence, create a “background” set of randomized sequences with the same composition. Procedure: 1 Read the source sequence from a file – Use Bio.SeqIO 2 In a loop: Shuffle the sequence – Use random.shuffle from Python’s standard library Create a new SeqRecord from the shuffled sequence – Because SeqIO.write works with SeqRecords 3 Write the shuffled SeqRecords to another file Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresimport randomfrom Bio import SeqIOfrom Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecordo r i g r e c = SeqIO . r e a d ( "gi2.gb" , " genbank " )alphabet = o r i g r e c . seq . alphabetout recs = []for i in xrange (1 , 31): n u c l e o t i d e s = l i s t ( o r i g r e c . seq ) random . s h u f f l e ( n u c l e o t i d e s ) n e w s e q = Seq ( "" . j o i n ( n u c l e o t i d e s ) , a l p h a b e t ) n e w r e c = SeqRecord ( new seq , i d=" shuffle " + s t r ( i ) ) o u t r e c s . append ( n e w r e c )SeqIO . w r i t e ( o u t r e c s , " gi2_shuffled . fasta " , " fasta " ) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresExample: ORF translation Split a set of unannotated DNA sequences into unique ORFs, translating in all 6 frames. Biopython can help with each piece of this problem: 1 Parse the given unannotated DNA sequences (SeqIO.parse) 2 Get the template strand’s sequence (Seq.reverse complement) 3 Translate both strands into protein sequences (Seq.translate) 4 Shift each strand by +1 and +2 for alternate reading frames (string-like Seq slicing) 5 Split sequences at stop codons (Seq.split(’*’)) 6 Write translated sequences to a new file (SeqIO.write) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresdef t r a n s l a t e s i x f r a m e s ( seq , t a b l e =1): ””” T r a n s l a t e a n u c l e o t i d e s e q u e n c e i n 6 f r a m e s . R e t u r n s an i t e r a b l e o f 6 t r a n s l a t e d p r o t e i n sequences . ””” rev = seq . reverse complement () for i in range ( 3 ) : # Coding ( C r i c k ) s t r a n d y i e l d seq [ i : ] . t r a n s l a t e ( t a b l e ) # Template ( Watson ) s t r a n d y i e l d rev [ i : ] . t ransla te ( table ) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresdef t r a n s l a t e o r f s ( s e q u e n c e s , m i n p r o t l e n =60): ””” F i n d and t r a n s l a t e a l l ORFs i n s e q u e n c e s . T r a n s l a t e s each sequence i n a l l 6 r e a d i n g frames , s p l i t s s e q u e n c e s on s t o p codons , and p r o d u c e s an i t e r a b l e of a l l p r o t e i n sequences of length at least min prot len . ””” for seq in sequences : for frame in t r a n s l a t e s i x f r a m e s ( seq ) : f o r p r o t i n f r a m e . s p l i t ( "*" ) : i f l e n ( p r o t ) >= m i n p r o t l e n : y i e l d prot Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresfrom Bio import SeqIOfrom Bio . SeqRecord import SeqRecordif name == " __main__ " : import s y s i n f i l e = sys . stdin o u t f i l e = sys . stdout r e c o r d s = SeqIO . p a r s e ( i n f i l e , " fasta " ) seqs = ( rec . seq for rec in r e c o r d s ) proteins = t r a n s l a t e o r f s ( seqs ) s e q r e c s = ( SeqRecord ( seq , i d="orf"+s t r ( i ) ) for i , seq in enumerate ( o r f s ) ) SeqIO . w r i t e ( s r e c s , o u t f i l e , " fasta " ) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresAlignIO and the Alignment object Alignment: a set of sequences with the same length and alphabet. Use AlignIO just like SeqIO: >>> from Bio import AlignIO >>> aln = AlignIO.read("PF01601.sto", "stockholm") >>> print aln SingleLetterAlphabet() alignment with 22 rows and 730 columns NCTDAV-----LTYSSFGVCADGSIIA-VQPRNV-----SYDSV...HIQ Q1HVL3 CVH22/539-1170 NCTTAV-----MTYSNFGICADGSLIP-VRPRNS-----SDNGI...HVQ SPIKE CVHNL/723-1356 NCTEPV-----LVYSNIGVCKSGSIGY-VPSQS------GQVKI...HVQ Q692M1 9CORO/740-1383 NCTEPA-----LVYSNIGVCKNGAIGL-VGIRN------TQPKI...HIQ Q0Q4F4 9CORO/729-1360 NCTSPR-----LVYSNIGVCTSGAIGL-LSPKX------AQPQI...HVQ Q0Q4F6 9CORO/743-1371 NCTNPV-----LTYSSYGVCPDGSITR-LGLTD------VQPHF...--T A4ULL0 9CORO/726-1328 NCTKPV-----LSYGPISVCSDGAIAG-TSTLQN-----TRPSI...KEW A6N263 9CORO/406-1035 ECDIPIGAGICASYHTVSLLRSTSQKSIVAYTMS------LGAD...HYT Q6T7X8 CVHSA/647-1255 ... DCE-PV-----ITYSNIGVCKNGAFVF-INVTH------SDGDV...HVH Q0PKZ5 CVPPU/797-1449 Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures Snack Time Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresEUtils and BLAST Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresEUtils: Entrez Programming Utilities Access NCBI’s online services: from Bio import Entrez Entrez.email = "you@uga.edu" Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresEUtils: Entrez Programming Utilities Access NCBI’s online services: from Bio import Entrez Entrez.email = "you@uga.edu" Request a GenBank record: handle = Entrez.efetch(db="protein", id="69316", rettype="gb", retmode="text") record = SeqIO.read(handle, "gb") Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresEUtils: Entrez Programming Utilities Access NCBI’s online services: from Bio import Entrez Entrez.email = "you@uga.edu" Request a GenBank record: handle = Entrez.efetch(db="protein", id="69316", rettype="gb", retmode="text") record = SeqIO.read(handle, "gb") Specify multiple IDs in one query: handle = Entrez.efetch(db="protein", id="349839,349840", rettype="fasta", retmode="text") records = SeqIO.parse(handle, "fasta") Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresInterlude: SeqRecord attributes seq: the sequence (Seq) itself id: primary ID for the sequence, e.g. accession number (string) name: “common” name/id for the sequence, like GenBank LOCUS id description: human-readible description of the sequence letter annotations: restricted dictionary of additional info about individual letters in the sequence, e.g. quality scores annotations: dictionary of additional unstructured info features: list of SeqFeature objects with more structured information — e.g. position of genes on a genome, domains on a protein sequence. dbxrefs: list of database cross-references (strings) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresfrom Bio import E n t r e z , SeqIOE n t r e z . e m a i l = " me@uga .edu"h a n d l e = E n t r e z . e f e t c h ( db=" nucleotide " , i d=" M95169 " , r e t t y p e="gb" , r et m od e="text" )r e c o r d = SeqIO . r e a d ( h a n d l e , " genbank " )handle . c l o s e ()print recordprint record . features [10]s l i c e d = record [20000:] # L a s t ˜25% o f t h e genomeprint s l i c e dfrom Bio . Seq import Seqfrom Bio . A l p h a b e t import g e n e r i c p r o t e i nt r a n s l a t i o n s = [ f . q u a l i f i e r s [ " translation " ] for f in record . f e a t u r e s [ 1 : ] ]p r o t e i n s = [ Seq ( t [ 0 ] , g e n e r i c p r o t e i n ) for t in t r a n s l a t i o n s ] Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresNCBI Blast BLAST can be used either standalone or through NCBI’s server. Online: >>> from Bio.Blast import NCBIWWW >>> result handle = NCBIWWW.qblast( ’blastp’, ’nr’, query string) Standalone: “Legacy” (blastall): >>> from Bio.Blast.Applications import BlastallCommandline >>> help(BlastallCommandline) New hotness (Blast+): >>> from Bio.Blast.Applications import NcbiblastpCommandline >>> help(NcbiblastpCommandline) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresParsing BLAST output BLAST produces reports in plain-text and XML format. Biopython requests XML by default. >>> from Bio.Blast import NCBIWWW, NCBIXML >>> result handle = NCBIWWW.qblast(’blastp’, ... ’nr’, query string) >>> blast record = NCBIXML.read(result handle) >>> print blast record Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures# S e a r c h f o r homologs o f a p r o t e i n s e q u e n c efrom Bio import SeqIOfrom Bio . B l a s t import NBCIWWW, NCBIXML# Read and r e f o r m a t t h e q u e r y s e q u e n c es e q r e c = SeqIO . r e a d ( ’gi2.gb ’ , ’gb ’ )q u e r y = s e q r e c . f o r m a t ( ’fasta ’ )# Submit an o n l i n e BLAST q u e r y# ( T h i s t a k e s some t i m e t o r u n )r e s u l t h a n d l e = NCBIWWW. q b l a s t ( ’blastx ’ , ’nr ’ , q u e r y ) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures# 1 . Save t h e BLAST r e s u l t s a s an XML f i l ew i t h open ( ’aprotinin_blast .xml ’ , ’w’ ) a s s a v e f i l e : s a v e f i l e . write ( r e s u l t h a n d l e . read ())result handle . close ()# NB : The BLAST r e s u l t h a n d l e can o n l y be r e a d once# R e l o a d i t from t h e f i l ew i t h open ( ’aprotinin_blast .xml ’ ) a s r e s u l t h a n d l e : b l a s t r e c o r d = NCBIXML . r e a d ( r e s u l t h a n d l e ) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures# 2 . D i s p l a y a h i s t o g r a m o f BLAST h i t s c o r e sdef g e t s c o r e s ( a l i g n m e n t s ) : for aln in alignments : f o r hsp i n a l n . h s p s : y i e l d hsp . s c o r escores = l i s t ( get scores ( blast record . alignments ))# Draw t h e h i s t o g r a mimport p y l a bp y l a b . h i s t ( s c o r e s , b i n s =20)p y l a b . t i t l e ( " Scores of %d BLAST hits" % l e n ( s c o r e s ) )p y l a b . x l a b e l ( " BLAST score " )p y l a b . y l a b e l ( "# hits" )p y l a b . show ( )# Save a copy f o r l a t e rp y l a b . s a v e f i g ( ’aprotinin_scores .png ’ ) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresFigure: Histogram of BLAST scores generated by pylab Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures# 3 . E x t r a c t t h e s e q u e n c e s o f h i g h −s c o r i n g BLAST h i t sfrom Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecorddef g e t h s p s ( a l i g n m e n t s , t h r e s h o l d ) : for aln in alignments : f o r hsp i n a l n . h s p s : i f hsp . s c o r e >= t h r e s h o l d : y i e l d SeqRecord ( Seq ( hsp . s b j c t ) , i d=a l n . a c c e s s i o n ) breakb e s t s e q s = g e t h s p s ( b l a s t r e c o r d . alignments , 321)SeqIO . w r i t e ( b e s t s e q s , ’aprotinin . fasta ’ , ’fasta ’ ) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresCalling other external programs Biopython has wrappers for other command-line programs in: Bio.Blast.Applications — the Blast+ suite Bio.Align.Applications — Muscle, ClustalW, . . . Bio.Emboss.Applications — needle, water, . . . Let’s re-align our BLAST results using Muscle, and format the alignment for use with stand-alone Phylip. Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresfrom Bio import A l i g n I Ofrom Bio . A l i g n . A p p l i c a t i o n s import MuscleCommandlinefrom S t r i n g I O import S t r i n g I O# C o n s t r u c t t h e s h e l l commandmuscle cmd = MuscleCommandline ( i n p u t=" aprotinin . fasta " )# E x e c u t e t h e command# Get o u t p u t ( t h e a l i g n m e n t ) and any e r r o r m e s s a g e sm u s c l e o u t , m u s c l e e r r = muscle cmd ( )# Read t h e a l i g n m e n t back i na l i g n = A l i g n I O . r e a d ( S t r i n g I O ( m u s c l e o u t ) , " fasta " )# Format t h e a l i g n m e n t f o r P h y l i pA l i g n I O . w r i t e ( [ a l i g n ] , ’aprotinin .phy ’ , ’phylip ’ ) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Phylogenetics Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresPhylogenetic tree I/O Start with: >>> from Bio import Phylo Input and output of trees is just like SeqIO: read, parse single or multiple trees in Newick, Nexus and PhyloXML formats write to any of the formats supported by read/parse convert between two formats in one step Use StringIO to load strings directly: >>> from cStringIO import StringIO >>> handle = StringIO("((A,B),(C,(D,E)));") >>> tree = Phylo.read(handle, "newick") Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresWhat’s in a tree? Make a tree with branch lengths: >>> tree = Phylo.read(StringIO("((A:1,B:1):2, ... (C:2,(D:1,E:1):1):1);"), "newick") View the object structure of the entire tree: >>> print tree Draw an “ASCII-art” (plain text) representation: >>> Phylo.draw ascii(tree) . . . OK, let’s do it properly now: >>> Phylo.draw(tree) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresModify the tree Check the tree object for its methods: >>> help(tree) Try a few: >>> tree.get terminals() >>> clade = tree.common ancestor("A", "B") >>> clade.color = "red" >>> tree.root with outgroup("D", "E") >>> tree.ladderize() >>> Phylo.draw(tree) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresExternal applications Biopython wraps a number of external programs for phylogenetics. We’re not going to use them now, but here’s where to find them: Bio.Phylo.PAML — PAML wrappers & helpers Bio.Phylo.Applications — command-line wrapper for PhyML (PhymlCommandline); RAxML and others on the way. (Anything you’d like to see sooner?) Bio.Emboss.Applications — other tools ported via Embassy, including Phylip Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Protein structures Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresGoing 3D: The PDB module Load a structure: >>> from Bio import PDB >>> parser = PDB.PDBParser() >>> struct = parser.get structure(’1ATP’, ’1ATP.pdb’) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresGoing 3D: The PDB module Load a structure: >>> from Bio import PDB >>> parser = PDB.PDBParser() >>> struct = parser.get structure(’1ATP’, ’1ATP.pdb’) Inspect the object hierarchy: >>> list(struct) >>> model = struct[0] >>> list(model) >>> chain = model[’E’] >>> list(chain) >>> residue = chain[15] >>> list(residue) Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresFigure: The “SMCRA” object hierarchy Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresExtracting a peptide sequence Get the amino acid sequence through a Polypeptide object: >>> from Bio import PDB >>> parser = PDB.PDBParser() >>> struct = parser.get structure(’1ATP’, ... ’1ATP.pdb’) >>> ppb = PDB.PPBuilder() >>> peptides = ppb.build peptides(struct) >>> for pep in peptides: ... print pep.get sequence() Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresCalculating RMSD Given two aligned structures, filter a list of target residues for high RMS deviation. Input: list of residue positions (integers) two equivalent chains from aligned protein models — residue numbers must match Minimum RMSD value (float) Output: list of residue positions, filtered Procedure: 1 Extract coordinates of Cα atoms 2 If available (not glycine), extract Cβ coordinates, too 3 Use Bio.SVDSuperimposer to calculate the RMSD between coordinates 4 Compare to the given RMSD threshold Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresfrom Bio . SVDSuperimposer import SVDSuperimposerfrom numpy import a r r a ydef f i l t r m s ( r e s i d s , r e f c h a i n , cmpchain , t h r e s h = 0 . 5 ) : s u p e r = SVDSuperimposer ( ) for res in r e s i d s : refres = refchain [ res ] cmpres = cmpchain [ r e s ] c o o r d 1 = [ r e f r e s [ ’CA ’ ] . g e t c o o r d ( ) ] c o o r d 2 = [ cmpres [ ’CA ’ ] . g e t c o o r d ( ) ] i f r e f r e s . h a s i d ( ’CB ’ ) and cmpres . h a s i d ( ’CB ’ ) : # Not g l y c i n e c o o r d 1 . append ( r e f r e s [ ’CB ’ ] . g e t c o o r d ( ) ) c o o r d 2 . append ( c m p r e s [ ’CB ’ ] . g e t c o o r d ( ) ) super . s e t ( a r r a y ( coord1 ) , a r r a y ( coord2 )) rmsd = s u p e r . g e t i n i t r m s ( ) i f rmsd >= t h r e s h o l d : yield res Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresFigure: Superimposed structures, with selected deviating residues Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresFurther reading Biopython tutorial: http: //biopython.org/DIST/docs/tutorial/Tutorial.html Biopython wiki: http://biopython.org/ This presentation: http://www.slideshare.net/etalevich/ biopython-programming-workshop-at-uga Eric Talevich IOB Workshop: Biopython
  • Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Thanks ’Preciate it. Gracias Eric Talevich IOB Workshop: Biopython