Biopython programming workshop at UGA
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Biopython programming workshop at UGA

on

  • 7,359 views

A workshop on bioinformatics programming using Biopython and the Python programming language, held at the University of Georgia in Spring 2010 and 2012. These workshops are part of a series for the ...

A workshop on bioinformatics programming using Biopython and the Python programming language, held at the University of Georgia in Spring 2010 and 2012. These workshops are part of a series for the Institute of Bioinformatics (IoB) and Bioinformatics Grad Student Association (BIGSA) at UGA.

Statistics

Views

Total Views
7,359
Views on SlideShare
7,171
Embed Views
188

Actions

Likes
6
Downloads
193
Comments
0

15 Embeds 188

http://etalog.blogspot.com 130
http://www.slideshare.net 35
http://translate.googleusercontent.com 6
http://etalog.blogspot.nl 5
http://etalog.blogspot.ie 2
http://www.docshut.com 1
http://etalog.blogspot.fr 1
http://www.linkedin.com 1
http://www.slashdocs.com 1
http://etalog.blogspot.ru 1
http://etalog.blogspot.ca 1
https://si0.twimg.com 1
https://translate.googleusercontent.com 1
http://twitter.com 1
http://etalog.blogspot.co.uk 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Biopython programming workshop at UGA Presentation Transcript

  • 1. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures IOB Workshop: BiopythonA programming toolkit for bioinformatics Eric Talevich Institute of Bioinformatics, University of Georgia Mar. 29, 2012 Eric Talevich IOB Workshop: Biopython
  • 2. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Getting started with Eric Talevich IOB Workshop: Biopython
  • 3. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresInstalling Python Biopython is a library for the Python programming language. First, you’ll need these installed: Python 2.7 from http://python.org. It may already be installed on your computer. (Version 2.6 is OK, too.) IDLE, a simple Integrated DeveLopment Environment. Usually bundled with the Python distribution. Now, start an interactive session in IDLE. 1 1 On your own, check out IPython (http://ipython.scipy.org/). It’s an enhanced Python interpreter that feels somewhat like R. Eric Talevich IOB Workshop: Biopython
  • 4. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresInstalling Python packages Biopython is a Python package. There are a few standard ways to install Python packages: From source: Download from PyPI 2 , unpack and install with the included setup.py script. easy install: Install from source 3 , then use the easy install command to fetch install all other packages by name: $ easy install <package name> pip: Like easy install, use pip 4 to manage packages: $ pip install <package name> 2 http://pypi.python.org/pypi/ 3 http://pypi.python.org/pypi/setuptools 4 http://pypi.python.org/pypi/pip Eric Talevich IOB Workshop: Biopython
  • 5. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresInstalling NumPy, matplotlib and Biopython Biopython relies on a few other Python packages for extra functionality. We’ll use these: numpy — efficient numerical functions and data structures (for Bio.PDB) matplotlib — plotting (for Bio.Phylo) Then finally: biopython — the reason we’re here today (Biopython, NumPy, matplotlib, setuptools and pip are also packaged for many Linux distributions.) Eric Talevich IOB Workshop: Biopython
  • 6. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresTesting Check your Biopython installation: >>> import Bio >>> print Bio. version Import a NumPy-based component: >>> from Bio import PDB Show a simple plot: >>> from matplotlib import pyplot >>> pyplot.plot(range(5), range(5)) >>> pyplot.show() Eric Talevich IOB Workshop: Biopython
  • 7. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Let’s start using Eric Talevich IOB Workshop: Biopython
  • 8. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Biopython1 Sequences and alignments The Seq object SeqIO and the SeqRecord object2 NCBI EUtils and BLAST EUtils: Entrez Programming Utilities NCBI Blast External programs3 Phylogenetics4 Protein structures Eric Talevich IOB Workshop: Biopython
  • 9. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures Sequences and Alignments Eric Talevich IOB Workshop: Biopython
  • 10. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresThe Seq object >>> from Bio.Seq import Seq >>> myseq = Seq(’AGTACACTGGT’) >>> myseq Seq(’AGTACACTGGT’, Alphabet()) >>> print myseq AGTACACTGGT >>> myseq.transcribe() Seq(’AGUACACUGGU’, RNAAlphabet()) >>> myseq.translate() Seq(’STL’, ExtendedIUPACProtein()) Eric Talevich IOB Workshop: Biopython
  • 11. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresA Seq object consists of: data — the underlying Python character string alphabet — DNA, RNA, protein, etc.It supports most Python string methods: >>> myseq.count(’GT’) 2And some biology-specific methods, too: >>> myseq.reverse complement() Seq(’ACCAGTGTACT’, Alphabet())Intrigued? Read on: >>> help(Seq) Eric Talevich IOB Workshop: Biopython
  • 12. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresSeqIO: Sequence Input/Output Sequence data is stored in many different file formats. Bio.SeqIO supports: abi fastq phylip swiss ace genbank pir tab clustal ig qual uniprot-xml embl imgt seqxml emboss nexus sff fasta phd stockholm Manually fetch some data from the PDB website: 5 1ATP.fasta — two protein sequences, FASTA format 1ATP.pdb — the 3D structure, for later 5 http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATP Eric Talevich IOB Workshop: Biopython
  • 13. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresThe SeqIO API SeqIO provides four functions: parse: Iteratively parse all elements in the file read: Parse a one-element file and return the element write: Write elements to a file convert: Parse one format and immediately write another Biopython uses the same I/O conventions for alignments (AlignIO), BLAST results (Blast), and phylogenetic trees (Phylo). Eric Talevich IOB Workshop: Biopython
  • 14. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresThe SeqRecord object SeqIO.parse returns SeqRecords. SeqRecord wraps a Seq object and attaches metadata. 1 Pass the file name to the SeqIO parser; specify FASTA format: from Bio import SeqIO seqrecs = SeqIO.parse("1ATP.fasta", "fasta") print seqrecs Eric Talevich IOB Workshop: Biopython
  • 15. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresThe SeqRecord object SeqIO.parse returns SeqRecords. SeqRecord wraps a Seq object and attaches metadata. 1 Pass the file name to the SeqIO parser; specify FASTA format: from Bio import SeqIO seqrecs = SeqIO.parse("1ATP.fasta", "fasta") print seqrecs 2 To see all records at once, convert the iterator to a list: allrecs = list(seqrecs) print allrecs[0] print allrecs[0].seq Eric Talevich IOB Workshop: Biopython
  • 16. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresExample: Shuffled sequences Given a real DNA sequence, create a “background” set of randomized sequences with the same composition. Procedure: 1 Read the source sequence from a file – Use Bio.SeqIO Eric Talevich IOB Workshop: Biopython
  • 17. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresExample: Shuffled sequences Given a real DNA sequence, create a “background” set of randomized sequences with the same composition. Procedure: 1 Read the source sequence from a file – Use Bio.SeqIO 2 In a loop: Shuffle the sequence – Use random.shuffle from Python’s standard library Create a new SeqRecord from the shuffled sequence – Because SeqIO.write works with SeqRecords Eric Talevich IOB Workshop: Biopython
  • 18. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresExample: Shuffled sequences Given a real DNA sequence, create a “background” set of randomized sequences with the same composition. Procedure: 1 Read the source sequence from a file – Use Bio.SeqIO 2 In a loop: Shuffle the sequence – Use random.shuffle from Python’s standard library Create a new SeqRecord from the shuffled sequence – Because SeqIO.write works with SeqRecords 3 Write the shuffled SeqRecords to another file Eric Talevich IOB Workshop: Biopython
  • 19. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresimport randomfrom Bio import SeqIOfrom Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecordo r i g r e c = SeqIO . r e a d ( "gi2.gb" , " genbank " )alphabet = o r i g r e c . seq . alphabetout recs = []for i in xrange (1 , 31): n u c l e o t i d e s = l i s t ( o r i g r e c . seq ) random . s h u f f l e ( n u c l e o t i d e s ) n e w s e q = Seq ( "" . j o i n ( n u c l e o t i d e s ) , a l p h a b e t ) n e w r e c = SeqRecord ( new seq , i d=" shuffle " + s t r ( i ) ) o u t r e c s . append ( n e w r e c )SeqIO . w r i t e ( o u t r e c s , " gi2_shuffled . fasta " , " fasta " ) Eric Talevich IOB Workshop: Biopython
  • 20. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresExample: ORF translation Split a set of unannotated DNA sequences into unique ORFs, translating in all 6 frames. Biopython can help with each piece of this problem: 1 Parse the given unannotated DNA sequences (SeqIO.parse) 2 Get the template strand’s sequence (Seq.reverse complement) 3 Translate both strands into protein sequences (Seq.translate) 4 Shift each strand by +1 and +2 for alternate reading frames (string-like Seq slicing) 5 Split sequences at stop codons (Seq.split(’*’)) 6 Write translated sequences to a new file (SeqIO.write) Eric Talevich IOB Workshop: Biopython
  • 21. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresdef t r a n s l a t e s i x f r a m e s ( seq , t a b l e =1): ””” T r a n s l a t e a n u c l e o t i d e s e q u e n c e i n 6 f r a m e s . R e t u r n s an i t e r a b l e o f 6 t r a n s l a t e d p r o t e i n sequences . ””” rev = seq . reverse complement () for i in range ( 3 ) : # Coding ( C r i c k ) s t r a n d y i e l d seq [ i : ] . t r a n s l a t e ( t a b l e ) # Template ( Watson ) s t r a n d y i e l d rev [ i : ] . t ransla te ( table ) Eric Talevich IOB Workshop: Biopython
  • 22. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresdef t r a n s l a t e o r f s ( s e q u e n c e s , m i n p r o t l e n =60): ””” F i n d and t r a n s l a t e a l l ORFs i n s e q u e n c e s . T r a n s l a t e s each sequence i n a l l 6 r e a d i n g frames , s p l i t s s e q u e n c e s on s t o p codons , and p r o d u c e s an i t e r a b l e of a l l p r o t e i n sequences of length at least min prot len . ””” for seq in sequences : for frame in t r a n s l a t e s i x f r a m e s ( seq ) : f o r p r o t i n f r a m e . s p l i t ( "*" ) : i f l e n ( p r o t ) >= m i n p r o t l e n : y i e l d prot Eric Talevich IOB Workshop: Biopython
  • 23. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresfrom Bio import SeqIOfrom Bio . SeqRecord import SeqRecordif name == " __main__ " : import s y s i n f i l e = sys . stdin o u t f i l e = sys . stdout r e c o r d s = SeqIO . p a r s e ( i n f i l e , " fasta " ) seqs = ( rec . seq for rec in r e c o r d s ) proteins = t r a n s l a t e o r f s ( seqs ) s e q r e c s = ( SeqRecord ( seq , i d="orf"+s t r ( i ) ) for i , seq in enumerate ( o r f s ) ) SeqIO . w r i t e ( s r e c s , o u t f i l e , " fasta " ) Eric Talevich IOB Workshop: Biopython
  • 24. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresAlignIO and the Alignment object Alignment: a set of sequences with the same length and alphabet. Use AlignIO just like SeqIO: >>> from Bio import AlignIO >>> aln = AlignIO.read("PF01601.sto", "stockholm") >>> print aln SingleLetterAlphabet() alignment with 22 rows and 730 columns NCTDAV-----LTYSSFGVCADGSIIA-VQPRNV-----SYDSV...HIQ Q1HVL3 CVH22/539-1170 NCTTAV-----MTYSNFGICADGSLIP-VRPRNS-----SDNGI...HVQ SPIKE CVHNL/723-1356 NCTEPV-----LVYSNIGVCKSGSIGY-VPSQS------GQVKI...HVQ Q692M1 9CORO/740-1383 NCTEPA-----LVYSNIGVCKNGAIGL-VGIRN------TQPKI...HIQ Q0Q4F4 9CORO/729-1360 NCTSPR-----LVYSNIGVCTSGAIGL-LSPKX------AQPQI...HVQ Q0Q4F6 9CORO/743-1371 NCTNPV-----LTYSSYGVCPDGSITR-LGLTD------VQPHF...--T A4ULL0 9CORO/726-1328 NCTKPV-----LSYGPISVCSDGAIAG-TSTLQN-----TRPSI...KEW A6N263 9CORO/406-1035 ECDIPIGAGICASYHTVSLLRSTSQKSIVAYTMS------LGAD...HYT Q6T7X8 CVHSA/647-1255 ... DCE-PV-----ITYSNIGVCKNGAFVF-INVTH------SDGDV...HVH Q0PKZ5 CVPPU/797-1449 Eric Talevich IOB Workshop: Biopython
  • 25. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures Snack Time Eric Talevich IOB Workshop: Biopython
  • 26. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresEUtils and BLAST Eric Talevich IOB Workshop: Biopython
  • 27. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresEUtils: Entrez Programming Utilities Access NCBI’s online services: from Bio import Entrez Entrez.email = "you@uga.edu" Eric Talevich IOB Workshop: Biopython
  • 28. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresEUtils: Entrez Programming Utilities Access NCBI’s online services: from Bio import Entrez Entrez.email = "you@uga.edu" Request a GenBank record: handle = Entrez.efetch(db="protein", id="69316", rettype="gb", retmode="text") record = SeqIO.read(handle, "gb") Eric Talevich IOB Workshop: Biopython
  • 29. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresEUtils: Entrez Programming Utilities Access NCBI’s online services: from Bio import Entrez Entrez.email = "you@uga.edu" Request a GenBank record: handle = Entrez.efetch(db="protein", id="69316", rettype="gb", retmode="text") record = SeqIO.read(handle, "gb") Specify multiple IDs in one query: handle = Entrez.efetch(db="protein", id="349839,349840", rettype="fasta", retmode="text") records = SeqIO.parse(handle, "fasta") Eric Talevich IOB Workshop: Biopython
  • 30. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresInterlude: SeqRecord attributes seq: the sequence (Seq) itself id: primary ID for the sequence, e.g. accession number (string) name: “common” name/id for the sequence, like GenBank LOCUS id description: human-readible description of the sequence letter annotations: restricted dictionary of additional info about individual letters in the sequence, e.g. quality scores annotations: dictionary of additional unstructured info features: list of SeqFeature objects with more structured information — e.g. position of genes on a genome, domains on a protein sequence. dbxrefs: list of database cross-references (strings) Eric Talevich IOB Workshop: Biopython
  • 31. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresfrom Bio import E n t r e z , SeqIOE n t r e z . e m a i l = " me@uga .edu"h a n d l e = E n t r e z . e f e t c h ( db=" nucleotide " , i d=" M95169 " , r e t t y p e="gb" , r et m od e="text" )r e c o r d = SeqIO . r e a d ( h a n d l e , " genbank " )handle . c l o s e ()print recordprint record . features [10]s l i c e d = record [20000:] # L a s t ˜25% o f t h e genomeprint s l i c e dfrom Bio . Seq import Seqfrom Bio . A l p h a b e t import g e n e r i c p r o t e i nt r a n s l a t i o n s = [ f . q u a l i f i e r s [ " translation " ] for f in record . f e a t u r e s [ 1 : ] ]p r o t e i n s = [ Seq ( t [ 0 ] , g e n e r i c p r o t e i n ) for t in t r a n s l a t i o n s ] Eric Talevich IOB Workshop: Biopython
  • 32. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresNCBI Blast BLAST can be used either standalone or through NCBI’s server. Online: >>> from Bio.Blast import NCBIWWW >>> result handle = NCBIWWW.qblast( ’blastp’, ’nr’, query string) Standalone: “Legacy” (blastall): >>> from Bio.Blast.Applications import BlastallCommandline >>> help(BlastallCommandline) New hotness (Blast+): >>> from Bio.Blast.Applications import NcbiblastpCommandline >>> help(NcbiblastpCommandline) Eric Talevich IOB Workshop: Biopython
  • 33. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresParsing BLAST output BLAST produces reports in plain-text and XML format. Biopython requests XML by default. >>> from Bio.Blast import NCBIWWW, NCBIXML >>> result handle = NCBIWWW.qblast(’blastp’, ... ’nr’, query string) >>> blast record = NCBIXML.read(result handle) >>> print blast record Eric Talevich IOB Workshop: Biopython
  • 34. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures# S e a r c h f o r homologs o f a p r o t e i n s e q u e n c efrom Bio import SeqIOfrom Bio . B l a s t import NBCIWWW, NCBIXML# Read and r e f o r m a t t h e q u e r y s e q u e n c es e q r e c = SeqIO . r e a d ( ’gi2.gb ’ , ’gb ’ )q u e r y = s e q r e c . f o r m a t ( ’fasta ’ )# Submit an o n l i n e BLAST q u e r y# ( T h i s t a k e s some t i m e t o r u n )r e s u l t h a n d l e = NCBIWWW. q b l a s t ( ’blastx ’ , ’nr ’ , q u e r y ) Eric Talevich IOB Workshop: Biopython
  • 35. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures# 1 . Save t h e BLAST r e s u l t s a s an XML f i l ew i t h open ( ’aprotinin_blast .xml ’ , ’w’ ) a s s a v e f i l e : s a v e f i l e . write ( r e s u l t h a n d l e . read ())result handle . close ()# NB : The BLAST r e s u l t h a n d l e can o n l y be r e a d once# R e l o a d i t from t h e f i l ew i t h open ( ’aprotinin_blast .xml ’ ) a s r e s u l t h a n d l e : b l a s t r e c o r d = NCBIXML . r e a d ( r e s u l t h a n d l e ) Eric Talevich IOB Workshop: Biopython
  • 36. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures# 2 . D i s p l a y a h i s t o g r a m o f BLAST h i t s c o r e sdef g e t s c o r e s ( a l i g n m e n t s ) : for aln in alignments : f o r hsp i n a l n . h s p s : y i e l d hsp . s c o r escores = l i s t ( get scores ( blast record . alignments ))# Draw t h e h i s t o g r a mimport p y l a bp y l a b . h i s t ( s c o r e s , b i n s =20)p y l a b . t i t l e ( " Scores of %d BLAST hits" % l e n ( s c o r e s ) )p y l a b . x l a b e l ( " BLAST score " )p y l a b . y l a b e l ( "# hits" )p y l a b . show ( )# Save a copy f o r l a t e rp y l a b . s a v e f i g ( ’aprotinin_scores .png ’ ) Eric Talevich IOB Workshop: Biopython
  • 37. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresFigure: Histogram of BLAST scores generated by pylab Eric Talevich IOB Workshop: Biopython
  • 38. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures# 3 . E x t r a c t t h e s e q u e n c e s o f h i g h −s c o r i n g BLAST h i t sfrom Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecorddef g e t h s p s ( a l i g n m e n t s , t h r e s h o l d ) : for aln in alignments : f o r hsp i n a l n . h s p s : i f hsp . s c o r e >= t h r e s h o l d : y i e l d SeqRecord ( Seq ( hsp . s b j c t ) , i d=a l n . a c c e s s i o n ) breakb e s t s e q s = g e t h s p s ( b l a s t r e c o r d . alignments , 321)SeqIO . w r i t e ( b e s t s e q s , ’aprotinin . fasta ’ , ’fasta ’ ) Eric Talevich IOB Workshop: Biopython
  • 39. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresCalling other external programs Biopython has wrappers for other command-line programs in: Bio.Blast.Applications — the Blast+ suite Bio.Align.Applications — Muscle, ClustalW, . . . Bio.Emboss.Applications — needle, water, . . . Let’s re-align our BLAST results using Muscle, and format the alignment for use with stand-alone Phylip. Eric Talevich IOB Workshop: Biopython
  • 40. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresfrom Bio import A l i g n I Ofrom Bio . A l i g n . A p p l i c a t i o n s import MuscleCommandlinefrom S t r i n g I O import S t r i n g I O# C o n s t r u c t t h e s h e l l commandmuscle cmd = MuscleCommandline ( i n p u t=" aprotinin . fasta " )# E x e c u t e t h e command# Get o u t p u t ( t h e a l i g n m e n t ) and any e r r o r m e s s a g e sm u s c l e o u t , m u s c l e e r r = muscle cmd ( )# Read t h e a l i g n m e n t back i na l i g n = A l i g n I O . r e a d ( S t r i n g I O ( m u s c l e o u t ) , " fasta " )# Format t h e a l i g n m e n t f o r P h y l i pA l i g n I O . w r i t e ( [ a l i g n ] , ’aprotinin .phy ’ , ’phylip ’ ) Eric Talevich IOB Workshop: Biopython
  • 41. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Phylogenetics Eric Talevich IOB Workshop: Biopython
  • 42. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresPhylogenetic tree I/O Start with: >>> from Bio import Phylo Input and output of trees is just like SeqIO: read, parse single or multiple trees in Newick, Nexus and PhyloXML formats write to any of the formats supported by read/parse convert between two formats in one step Use StringIO to load strings directly: >>> from cStringIO import StringIO >>> handle = StringIO("((A,B),(C,(D,E)));") >>> tree = Phylo.read(handle, "newick") Eric Talevich IOB Workshop: Biopython
  • 43. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresWhat’s in a tree? Make a tree with branch lengths: >>> tree = Phylo.read(StringIO("((A:1,B:1):2, ... (C:2,(D:1,E:1):1):1);"), "newick") View the object structure of the entire tree: >>> print tree Draw an “ASCII-art” (plain text) representation: >>> Phylo.draw ascii(tree) . . . OK, let’s do it properly now: >>> Phylo.draw(tree) Eric Talevich IOB Workshop: Biopython
  • 44. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresModify the tree Check the tree object for its methods: >>> help(tree) Try a few: >>> tree.get terminals() >>> clade = tree.common ancestor("A", "B") >>> clade.color = "red" >>> tree.root with outgroup("D", "E") >>> tree.ladderize() >>> Phylo.draw(tree) Eric Talevich IOB Workshop: Biopython
  • 45. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresExternal applications Biopython wraps a number of external programs for phylogenetics. We’re not going to use them now, but here’s where to find them: Bio.Phylo.PAML — PAML wrappers & helpers Bio.Phylo.Applications — command-line wrapper for PhyML (PhymlCommandline); RAxML and others on the way. (Anything you’d like to see sooner?) Bio.Emboss.Applications — other tools ported via Embassy, including Phylip Eric Talevich IOB Workshop: Biopython
  • 46. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Protein structures Eric Talevich IOB Workshop: Biopython
  • 47. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresGoing 3D: The PDB module Load a structure: >>> from Bio import PDB >>> parser = PDB.PDBParser() >>> struct = parser.get structure(’1ATP’, ’1ATP.pdb’) Eric Talevich IOB Workshop: Biopython
  • 48. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresGoing 3D: The PDB module Load a structure: >>> from Bio import PDB >>> parser = PDB.PDBParser() >>> struct = parser.get structure(’1ATP’, ’1ATP.pdb’) Inspect the object hierarchy: >>> list(struct) >>> model = struct[0] >>> list(model) >>> chain = model[’E’] >>> list(chain) >>> residue = chain[15] >>> list(residue) Eric Talevich IOB Workshop: Biopython
  • 49. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresFigure: The “SMCRA” object hierarchy Eric Talevich IOB Workshop: Biopython
  • 50. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresExtracting a peptide sequence Get the amino acid sequence through a Polypeptide object: >>> from Bio import PDB >>> parser = PDB.PDBParser() >>> struct = parser.get structure(’1ATP’, ... ’1ATP.pdb’) >>> ppb = PDB.PPBuilder() >>> peptides = ppb.build peptides(struct) >>> for pep in peptides: ... print pep.get sequence() Eric Talevich IOB Workshop: Biopython
  • 51. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresCalculating RMSD Given two aligned structures, filter a list of target residues for high RMS deviation. Input: list of residue positions (integers) two equivalent chains from aligned protein models — residue numbers must match Minimum RMSD value (float) Output: list of residue positions, filtered Procedure: 1 Extract coordinates of Cα atoms 2 If available (not glycine), extract Cβ coordinates, too 3 Use Bio.SVDSuperimposer to calculate the RMSD between coordinates 4 Compare to the given RMSD threshold Eric Talevich IOB Workshop: Biopython
  • 52. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresfrom Bio . SVDSuperimposer import SVDSuperimposerfrom numpy import a r r a ydef f i l t r m s ( r e s i d s , r e f c h a i n , cmpchain , t h r e s h = 0 . 5 ) : s u p e r = SVDSuperimposer ( ) for res in r e s i d s : refres = refchain [ res ] cmpres = cmpchain [ r e s ] c o o r d 1 = [ r e f r e s [ ’CA ’ ] . g e t c o o r d ( ) ] c o o r d 2 = [ cmpres [ ’CA ’ ] . g e t c o o r d ( ) ] i f r e f r e s . h a s i d ( ’CB ’ ) and cmpres . h a s i d ( ’CB ’ ) : # Not g l y c i n e c o o r d 1 . append ( r e f r e s [ ’CB ’ ] . g e t c o o r d ( ) ) c o o r d 2 . append ( c m p r e s [ ’CB ’ ] . g e t c o o r d ( ) ) super . s e t ( a r r a y ( coord1 ) , a r r a y ( coord2 )) rmsd = s u p e r . g e t i n i t r m s ( ) i f rmsd >= t h r e s h o l d : yield res Eric Talevich IOB Workshop: Biopython
  • 53. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresFigure: Superimposed structures, with selected deviating residues Eric Talevich IOB Workshop: Biopython
  • 54. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresFurther reading Biopython tutorial: http: //biopython.org/DIST/docs/tutorial/Tutorial.html Biopython wiki: http://biopython.org/ This presentation: http://www.slideshare.net/etalevich/ biopython-programming-workshop-at-uga Eric Talevich IOB Workshop: Biopython
  • 55. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Thanks ’Preciate it. Gracias Eric Talevich IOB Workshop: Biopython