Sequences and alignments    NCBI EUtils and BLAST              Phylogenetics          Protein structures   IOB Workshop: B...
Sequences and alignments NCBI EUtils and BLAST           Phylogenetics       Protein structures      Getting started      ...
Sequences and alignments                    NCBI EUtils and BLAST                              Phylogenetics              ...
Sequences and alignments                   NCBI EUtils and BLAST                             Phylogenetics                ...
Sequences and alignments                    NCBI EUtils and BLAST                              Phylogenetics              ...
Sequences and alignments                  NCBI EUtils and BLAST                            Phylogenetics                  ...
Sequences and alignments NCBI EUtils and BLAST           Phylogenetics       Protein structures    Let’s start using      ...
Sequences and alignments                  NCBI EUtils and BLAST                            Phylogenetics                  ...
Sequences and alignments NCBI EUtils and BLAST      The Seq object           Phylogenetics    SeqIO and the SeqRecord obje...
Sequences and alignments                  NCBI EUtils and BLAST      The Seq object                            Phylogeneti...
Sequences and alignments                 NCBI EUtils and BLAST      The Seq object                           Phylogenetics...
Sequences and alignments                     NCBI EUtils and BLAST        The Seq object                               Phy...
Sequences and alignments                   NCBI EUtils and BLAST      The Seq object                             Phylogene...
Sequences and alignments                   NCBI EUtils and BLAST      The Seq object                             Phylogene...
Sequences and alignments                    NCBI EUtils and BLAST      The Seq object                              Phyloge...
Sequences and alignments                    NCBI EUtils and BLAST      The Seq object                              Phyloge...
Sequences and alignments                    NCBI EUtils and BLAST      The Seq object                              Phyloge...
Sequences and alignments                    NCBI EUtils and BLAST      The Seq object                              Phyloge...
Sequences and alignments                      NCBI EUtils and BLAST      The Seq object                                Phy...
Sequences and alignments                    NCBI EUtils and BLAST      The Seq object                              Phyloge...
Sequences and alignments                     NCBI EUtils and BLAST      The Seq object                               Phylo...
Sequences and alignments                     NCBI EUtils and BLAST      The Seq object                               Phylo...
Sequences and alignments                  NCBI EUtils and BLAST      The Seq object                            Phylogeneti...
Sequences and alignments                          NCBI EUtils and BLAST      The Seq object                               ...
Sequences and alignments NCBI EUtils and BLAST      The Seq object           Phylogenetics    SeqIO and the SeqRecord obje...
Sequences and alignments                            EUtils: Entrez Programming Utilities NCBI EUtils and BLAST            ...
Sequences and alignments                                           EUtils: Entrez Programming Utilities                NCB...
Sequences and alignments                                           EUtils: Entrez Programming Utilities                NCB...
Sequences and alignments                                           EUtils: Entrez Programming Utilities                NCB...
Sequences and alignments                                               EUtils: Entrez Programming Utilities               ...
Sequences and alignments                                               EUtils: Entrez Programming Utilities               ...
Sequences and alignments                                             EUtils: Entrez Programming Utilities                 ...
Sequences and alignments                                             EUtils: Entrez Programming Utilities                 ...
Sequences and alignments                                               EUtils: Entrez Programming Utilities               ...
Sequences and alignments                                              EUtils: Entrez Programming Utilities                ...
Sequences and alignments                                                 EUtils: Entrez Programming Utilities             ...
Sequences and alignments                                    EUtils: Entrez Programming Utilities         NCBI EUtils and B...
Sequences and alignments                                                  EUtils: Entrez Programming Utilities            ...
Sequences and alignments                                              EUtils: Entrez Programming Utilities                ...
Sequences and alignments                       NCBI EUtils and BLAST                                 Phylogenetics        ...
Sequences and alignments NCBI EUtils and BLAST           Phylogenetics       Protein structures        Phylogenetics      ...
Sequences and alignments                   NCBI EUtils and BLAST                             Phylogenetics                ...
Sequences and alignments                     NCBI EUtils and BLAST                               Phylogenetics            ...
Sequences and alignments                   NCBI EUtils and BLAST                             Phylogenetics                ...
Sequences and alignments                  NCBI EUtils and BLAST                            Phylogenetics                  ...
Sequences and alignments NCBI EUtils and BLAST           Phylogenetics       Protein structures              Protein      ...
Sequences and alignments                 NCBI EUtils and BLAST                           Phylogenetics                    ...
Sequences and alignments                 NCBI EUtils and BLAST                           Phylogenetics                    ...
Sequences and alignments  NCBI EUtils and BLAST            Phylogenetics        Protein structuresFigure: The “SMCRA” obje...
Sequences and alignments                  NCBI EUtils and BLAST                            Phylogenetics                  ...
Sequences and alignments                  NCBI EUtils and BLAST                            Phylogenetics                  ...
Sequences and alignments                     NCBI EUtils and BLAST                               Phylogenetics            ...
Sequences and alignments              NCBI EUtils and BLAST                        Phylogenetics                    Protei...
Sequences and alignments                   NCBI EUtils and BLAST                             Phylogenetics                ...
Sequences and alignments NCBI EUtils and BLAST           Phylogenetics       Protein structures             Thanks        ...
Upcoming SlideShare
Loading in...5
×

Biopython programming workshop at UGA

6,245

Published on

A workshop on bioinformatics programming using Biopython and the Python programming language, held at the University of Georgia in Spring 2010 and 2012. These workshops are part of a series for the Institute of Bioinformatics (IoB) and Bioinformatics Grad Student Association (BIGSA) at UGA.

Published in: Technology, Education
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,245
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
211
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Biopython programming workshop at UGA

  1. 1. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures IOB Workshop: BiopythonA programming toolkit for bioinformatics Eric Talevich Institute of Bioinformatics, University of Georgia Mar. 29, 2012 Eric Talevich IOB Workshop: Biopython
  2. 2. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Getting started with Eric Talevich IOB Workshop: Biopython
  3. 3. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresInstalling Python Biopython is a library for the Python programming language. First, you’ll need these installed: Python 2.7 from http://python.org. It may already be installed on your computer. (Version 2.6 is OK, too.) IDLE, a simple Integrated DeveLopment Environment. Usually bundled with the Python distribution. Now, start an interactive session in IDLE. 1 1 On your own, check out IPython (http://ipython.scipy.org/). It’s an enhanced Python interpreter that feels somewhat like R. Eric Talevich IOB Workshop: Biopython
  4. 4. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresInstalling Python packages Biopython is a Python package. There are a few standard ways to install Python packages: From source: Download from PyPI 2 , unpack and install with the included setup.py script. easy install: Install from source 3 , then use the easy install command to fetch install all other packages by name: $ easy install <package name> pip: Like easy install, use pip 4 to manage packages: $ pip install <package name> 2 http://pypi.python.org/pypi/ 3 http://pypi.python.org/pypi/setuptools 4 http://pypi.python.org/pypi/pip Eric Talevich IOB Workshop: Biopython
  5. 5. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresInstalling NumPy, matplotlib and Biopython Biopython relies on a few other Python packages for extra functionality. We’ll use these: numpy — efficient numerical functions and data structures (for Bio.PDB) matplotlib — plotting (for Bio.Phylo) Then finally: biopython — the reason we’re here today (Biopython, NumPy, matplotlib, setuptools and pip are also packaged for many Linux distributions.) Eric Talevich IOB Workshop: Biopython
  6. 6. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresTesting Check your Biopython installation: >>> import Bio >>> print Bio. version Import a NumPy-based component: >>> from Bio import PDB Show a simple plot: >>> from matplotlib import pyplot >>> pyplot.plot(range(5), range(5)) >>> pyplot.show() Eric Talevich IOB Workshop: Biopython
  7. 7. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Let’s start using Eric Talevich IOB Workshop: Biopython
  8. 8. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Biopython1 Sequences and alignments The Seq object SeqIO and the SeqRecord object2 NCBI EUtils and BLAST EUtils: Entrez Programming Utilities NCBI Blast External programs3 Phylogenetics4 Protein structures Eric Talevich IOB Workshop: Biopython
  9. 9. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures Sequences and Alignments Eric Talevich IOB Workshop: Biopython
  10. 10. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresThe Seq object >>> from Bio.Seq import Seq >>> myseq = Seq(’AGTACACTGGT’) >>> myseq Seq(’AGTACACTGGT’, Alphabet()) >>> print myseq AGTACACTGGT >>> myseq.transcribe() Seq(’AGUACACUGGU’, RNAAlphabet()) >>> myseq.translate() Seq(’STL’, ExtendedIUPACProtein()) Eric Talevich IOB Workshop: Biopython
  11. 11. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresA Seq object consists of: data — the underlying Python character string alphabet — DNA, RNA, protein, etc.It supports most Python string methods: >>> myseq.count(’GT’) 2And some biology-specific methods, too: >>> myseq.reverse complement() Seq(’ACCAGTGTACT’, Alphabet())Intrigued? Read on: >>> help(Seq) Eric Talevich IOB Workshop: Biopython
  12. 12. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresSeqIO: Sequence Input/Output Sequence data is stored in many different file formats. Bio.SeqIO supports: abi fastq phylip swiss ace genbank pir tab clustal ig qual uniprot-xml embl imgt seqxml emboss nexus sff fasta phd stockholm Manually fetch some data from the PDB website: 5 1ATP.fasta — two protein sequences, FASTA format 1ATP.pdb — the 3D structure, for later 5 http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATP Eric Talevich IOB Workshop: Biopython
  13. 13. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresThe SeqIO API SeqIO provides four functions: parse: Iteratively parse all elements in the file read: Parse a one-element file and return the element write: Write elements to a file convert: Parse one format and immediately write another Biopython uses the same I/O conventions for alignments (AlignIO), BLAST results (Blast), and phylogenetic trees (Phylo). Eric Talevich IOB Workshop: Biopython
  14. 14. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresThe SeqRecord object SeqIO.parse returns SeqRecords. SeqRecord wraps a Seq object and attaches metadata. 1 Pass the file name to the SeqIO parser; specify FASTA format: from Bio import SeqIO seqrecs = SeqIO.parse("1ATP.fasta", "fasta") print seqrecs Eric Talevich IOB Workshop: Biopython
  15. 15. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresThe SeqRecord object SeqIO.parse returns SeqRecords. SeqRecord wraps a Seq object and attaches metadata. 1 Pass the file name to the SeqIO parser; specify FASTA format: from Bio import SeqIO seqrecs = SeqIO.parse("1ATP.fasta", "fasta") print seqrecs 2 To see all records at once, convert the iterator to a list: allrecs = list(seqrecs) print allrecs[0] print allrecs[0].seq Eric Talevich IOB Workshop: Biopython
  16. 16. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresExample: Shuffled sequences Given a real DNA sequence, create a “background” set of randomized sequences with the same composition. Procedure: 1 Read the source sequence from a file – Use Bio.SeqIO Eric Talevich IOB Workshop: Biopython
  17. 17. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresExample: Shuffled sequences Given a real DNA sequence, create a “background” set of randomized sequences with the same composition. Procedure: 1 Read the source sequence from a file – Use Bio.SeqIO 2 In a loop: Shuffle the sequence – Use random.shuffle from Python’s standard library Create a new SeqRecord from the shuffled sequence – Because SeqIO.write works with SeqRecords Eric Talevich IOB Workshop: Biopython
  18. 18. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresExample: Shuffled sequences Given a real DNA sequence, create a “background” set of randomized sequences with the same composition. Procedure: 1 Read the source sequence from a file – Use Bio.SeqIO 2 In a loop: Shuffle the sequence – Use random.shuffle from Python’s standard library Create a new SeqRecord from the shuffled sequence – Because SeqIO.write works with SeqRecords 3 Write the shuffled SeqRecords to another file Eric Talevich IOB Workshop: Biopython
  19. 19. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresimport randomfrom Bio import SeqIOfrom Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecordo r i g r e c = SeqIO . r e a d ( "gi2.gb" , " genbank " )alphabet = o r i g r e c . seq . alphabetout recs = []for i in xrange (1 , 31): n u c l e o t i d e s = l i s t ( o r i g r e c . seq ) random . s h u f f l e ( n u c l e o t i d e s ) n e w s e q = Seq ( "" . j o i n ( n u c l e o t i d e s ) , a l p h a b e t ) n e w r e c = SeqRecord ( new seq , i d=" shuffle " + s t r ( i ) ) o u t r e c s . append ( n e w r e c )SeqIO . w r i t e ( o u t r e c s , " gi2_shuffled . fasta " , " fasta " ) Eric Talevich IOB Workshop: Biopython
  20. 20. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresExample: ORF translation Split a set of unannotated DNA sequences into unique ORFs, translating in all 6 frames. Biopython can help with each piece of this problem: 1 Parse the given unannotated DNA sequences (SeqIO.parse) 2 Get the template strand’s sequence (Seq.reverse complement) 3 Translate both strands into protein sequences (Seq.translate) 4 Shift each strand by +1 and +2 for alternate reading frames (string-like Seq slicing) 5 Split sequences at stop codons (Seq.split(’*’)) 6 Write translated sequences to a new file (SeqIO.write) Eric Talevich IOB Workshop: Biopython
  21. 21. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresdef t r a n s l a t e s i x f r a m e s ( seq , t a b l e =1): ””” T r a n s l a t e a n u c l e o t i d e s e q u e n c e i n 6 f r a m e s . R e t u r n s an i t e r a b l e o f 6 t r a n s l a t e d p r o t e i n sequences . ””” rev = seq . reverse complement () for i in range ( 3 ) : # Coding ( C r i c k ) s t r a n d y i e l d seq [ i : ] . t r a n s l a t e ( t a b l e ) # Template ( Watson ) s t r a n d y i e l d rev [ i : ] . t ransla te ( table ) Eric Talevich IOB Workshop: Biopython
  22. 22. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresdef t r a n s l a t e o r f s ( s e q u e n c e s , m i n p r o t l e n =60): ””” F i n d and t r a n s l a t e a l l ORFs i n s e q u e n c e s . T r a n s l a t e s each sequence i n a l l 6 r e a d i n g frames , s p l i t s s e q u e n c e s on s t o p codons , and p r o d u c e s an i t e r a b l e of a l l p r o t e i n sequences of length at least min prot len . ””” for seq in sequences : for frame in t r a n s l a t e s i x f r a m e s ( seq ) : f o r p r o t i n f r a m e . s p l i t ( "*" ) : i f l e n ( p r o t ) >= m i n p r o t l e n : y i e l d prot Eric Talevich IOB Workshop: Biopython
  23. 23. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresfrom Bio import SeqIOfrom Bio . SeqRecord import SeqRecordif name == " __main__ " : import s y s i n f i l e = sys . stdin o u t f i l e = sys . stdout r e c o r d s = SeqIO . p a r s e ( i n f i l e , " fasta " ) seqs = ( rec . seq for rec in r e c o r d s ) proteins = t r a n s l a t e o r f s ( seqs ) s e q r e c s = ( SeqRecord ( seq , i d="orf"+s t r ( i ) ) for i , seq in enumerate ( o r f s ) ) SeqIO . w r i t e ( s r e c s , o u t f i l e , " fasta " ) Eric Talevich IOB Workshop: Biopython
  24. 24. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structuresAlignIO and the Alignment object Alignment: a set of sequences with the same length and alphabet. Use AlignIO just like SeqIO: >>> from Bio import AlignIO >>> aln = AlignIO.read("PF01601.sto", "stockholm") >>> print aln SingleLetterAlphabet() alignment with 22 rows and 730 columns NCTDAV-----LTYSSFGVCADGSIIA-VQPRNV-----SYDSV...HIQ Q1HVL3 CVH22/539-1170 NCTTAV-----MTYSNFGICADGSLIP-VRPRNS-----SDNGI...HVQ SPIKE CVHNL/723-1356 NCTEPV-----LVYSNIGVCKSGSIGY-VPSQS------GQVKI...HVQ Q692M1 9CORO/740-1383 NCTEPA-----LVYSNIGVCKNGAIGL-VGIRN------TQPKI...HIQ Q0Q4F4 9CORO/729-1360 NCTSPR-----LVYSNIGVCTSGAIGL-LSPKX------AQPQI...HVQ Q0Q4F6 9CORO/743-1371 NCTNPV-----LTYSSYGVCPDGSITR-LGLTD------VQPHF...--T A4ULL0 9CORO/726-1328 NCTKPV-----LSYGPISVCSDGAIAG-TSTLQN-----TRPSI...KEW A6N263 9CORO/406-1035 ECDIPIGAGICASYHTVSLLRSTSQKSIVAYTMS------LGAD...HYT Q6T7X8 CVHSA/647-1255 ... DCE-PV-----ITYSNIGVCKNGAFVF-INVTH------SDGDV...HVH Q0PKZ5 CVPPU/797-1449 Eric Talevich IOB Workshop: Biopython
  25. 25. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures Snack Time Eric Talevich IOB Workshop: Biopython
  26. 26. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresEUtils and BLAST Eric Talevich IOB Workshop: Biopython
  27. 27. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresEUtils: Entrez Programming Utilities Access NCBI’s online services: from Bio import Entrez Entrez.email = "you@uga.edu" Eric Talevich IOB Workshop: Biopython
  28. 28. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresEUtils: Entrez Programming Utilities Access NCBI’s online services: from Bio import Entrez Entrez.email = "you@uga.edu" Request a GenBank record: handle = Entrez.efetch(db="protein", id="69316", rettype="gb", retmode="text") record = SeqIO.read(handle, "gb") Eric Talevich IOB Workshop: Biopython
  29. 29. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresEUtils: Entrez Programming Utilities Access NCBI’s online services: from Bio import Entrez Entrez.email = "you@uga.edu" Request a GenBank record: handle = Entrez.efetch(db="protein", id="69316", rettype="gb", retmode="text") record = SeqIO.read(handle, "gb") Specify multiple IDs in one query: handle = Entrez.efetch(db="protein", id="349839,349840", rettype="fasta", retmode="text") records = SeqIO.parse(handle, "fasta") Eric Talevich IOB Workshop: Biopython
  30. 30. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresInterlude: SeqRecord attributes seq: the sequence (Seq) itself id: primary ID for the sequence, e.g. accession number (string) name: “common” name/id for the sequence, like GenBank LOCUS id description: human-readible description of the sequence letter annotations: restricted dictionary of additional info about individual letters in the sequence, e.g. quality scores annotations: dictionary of additional unstructured info features: list of SeqFeature objects with more structured information — e.g. position of genes on a genome, domains on a protein sequence. dbxrefs: list of database cross-references (strings) Eric Talevich IOB Workshop: Biopython
  31. 31. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresfrom Bio import E n t r e z , SeqIOE n t r e z . e m a i l = " me@uga .edu"h a n d l e = E n t r e z . e f e t c h ( db=" nucleotide " , i d=" M95169 " , r e t t y p e="gb" , r et m od e="text" )r e c o r d = SeqIO . r e a d ( h a n d l e , " genbank " )handle . c l o s e ()print recordprint record . features [10]s l i c e d = record [20000:] # L a s t ˜25% o f t h e genomeprint s l i c e dfrom Bio . Seq import Seqfrom Bio . A l p h a b e t import g e n e r i c p r o t e i nt r a n s l a t i o n s = [ f . q u a l i f i e r s [ " translation " ] for f in record . f e a t u r e s [ 1 : ] ]p r o t e i n s = [ Seq ( t [ 0 ] , g e n e r i c p r o t e i n ) for t in t r a n s l a t i o n s ] Eric Talevich IOB Workshop: Biopython
  32. 32. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresNCBI Blast BLAST can be used either standalone or through NCBI’s server. Online: >>> from Bio.Blast import NCBIWWW >>> result handle = NCBIWWW.qblast( ’blastp’, ’nr’, query string) Standalone: “Legacy” (blastall): >>> from Bio.Blast.Applications import BlastallCommandline >>> help(BlastallCommandline) New hotness (Blast+): >>> from Bio.Blast.Applications import NcbiblastpCommandline >>> help(NcbiblastpCommandline) Eric Talevich IOB Workshop: Biopython
  33. 33. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresParsing BLAST output BLAST produces reports in plain-text and XML format. Biopython requests XML by default. >>> from Bio.Blast import NCBIWWW, NCBIXML >>> result handle = NCBIWWW.qblast(’blastp’, ... ’nr’, query string) >>> blast record = NCBIXML.read(result handle) >>> print blast record Eric Talevich IOB Workshop: Biopython
  34. 34. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures# S e a r c h f o r homologs o f a p r o t e i n s e q u e n c efrom Bio import SeqIOfrom Bio . B l a s t import NBCIWWW, NCBIXML# Read and r e f o r m a t t h e q u e r y s e q u e n c es e q r e c = SeqIO . r e a d ( ’gi2.gb ’ , ’gb ’ )q u e r y = s e q r e c . f o r m a t ( ’fasta ’ )# Submit an o n l i n e BLAST q u e r y# ( T h i s t a k e s some t i m e t o r u n )r e s u l t h a n d l e = NCBIWWW. q b l a s t ( ’blastx ’ , ’nr ’ , q u e r y ) Eric Talevich IOB Workshop: Biopython
  35. 35. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures# 1 . Save t h e BLAST r e s u l t s a s an XML f i l ew i t h open ( ’aprotinin_blast .xml ’ , ’w’ ) a s s a v e f i l e : s a v e f i l e . write ( r e s u l t h a n d l e . read ())result handle . close ()# NB : The BLAST r e s u l t h a n d l e can o n l y be r e a d once# R e l o a d i t from t h e f i l ew i t h open ( ’aprotinin_blast .xml ’ ) a s r e s u l t h a n d l e : b l a s t r e c o r d = NCBIXML . r e a d ( r e s u l t h a n d l e ) Eric Talevich IOB Workshop: Biopython
  36. 36. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures# 2 . D i s p l a y a h i s t o g r a m o f BLAST h i t s c o r e sdef g e t s c o r e s ( a l i g n m e n t s ) : for aln in alignments : f o r hsp i n a l n . h s p s : y i e l d hsp . s c o r escores = l i s t ( get scores ( blast record . alignments ))# Draw t h e h i s t o g r a mimport p y l a bp y l a b . h i s t ( s c o r e s , b i n s =20)p y l a b . t i t l e ( " Scores of %d BLAST hits" % l e n ( s c o r e s ) )p y l a b . x l a b e l ( " BLAST score " )p y l a b . y l a b e l ( "# hits" )p y l a b . show ( )# Save a copy f o r l a t e rp y l a b . s a v e f i g ( ’aprotinin_scores .png ’ ) Eric Talevich IOB Workshop: Biopython
  37. 37. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresFigure: Histogram of BLAST scores generated by pylab Eric Talevich IOB Workshop: Biopython
  38. 38. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures# 3 . E x t r a c t t h e s e q u e n c e s o f h i g h −s c o r i n g BLAST h i t sfrom Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecorddef g e t h s p s ( a l i g n m e n t s , t h r e s h o l d ) : for aln in alignments : f o r hsp i n a l n . h s p s : i f hsp . s c o r e >= t h r e s h o l d : y i e l d SeqRecord ( Seq ( hsp . s b j c t ) , i d=a l n . a c c e s s i o n ) breakb e s t s e q s = g e t h s p s ( b l a s t r e c o r d . alignments , 321)SeqIO . w r i t e ( b e s t s e q s , ’aprotinin . fasta ’ , ’fasta ’ ) Eric Talevich IOB Workshop: Biopython
  39. 39. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structuresCalling other external programs Biopython has wrappers for other command-line programs in: Bio.Blast.Applications — the Blast+ suite Bio.Align.Applications — Muscle, ClustalW, . . . Bio.Emboss.Applications — needle, water, . . . Let’s re-align our BLAST results using Muscle, and format the alignment for use with stand-alone Phylip. Eric Talevich IOB Workshop: Biopython
  40. 40. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresfrom Bio import A l i g n I Ofrom Bio . A l i g n . A p p l i c a t i o n s import MuscleCommandlinefrom S t r i n g I O import S t r i n g I O# C o n s t r u c t t h e s h e l l commandmuscle cmd = MuscleCommandline ( i n p u t=" aprotinin . fasta " )# E x e c u t e t h e command# Get o u t p u t ( t h e a l i g n m e n t ) and any e r r o r m e s s a g e sm u s c l e o u t , m u s c l e e r r = muscle cmd ( )# Read t h e a l i g n m e n t back i na l i g n = A l i g n I O . r e a d ( S t r i n g I O ( m u s c l e o u t ) , " fasta " )# Format t h e a l i g n m e n t f o r P h y l i pA l i g n I O . w r i t e ( [ a l i g n ] , ’aprotinin .phy ’ , ’phylip ’ ) Eric Talevich IOB Workshop: Biopython
  41. 41. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Phylogenetics Eric Talevich IOB Workshop: Biopython
  42. 42. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresPhylogenetic tree I/O Start with: >>> from Bio import Phylo Input and output of trees is just like SeqIO: read, parse single or multiple trees in Newick, Nexus and PhyloXML formats write to any of the formats supported by read/parse convert between two formats in one step Use StringIO to load strings directly: >>> from cStringIO import StringIO >>> handle = StringIO("((A,B),(C,(D,E)));") >>> tree = Phylo.read(handle, "newick") Eric Talevich IOB Workshop: Biopython
  43. 43. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresWhat’s in a tree? Make a tree with branch lengths: >>> tree = Phylo.read(StringIO("((A:1,B:1):2, ... (C:2,(D:1,E:1):1):1);"), "newick") View the object structure of the entire tree: >>> print tree Draw an “ASCII-art” (plain text) representation: >>> Phylo.draw ascii(tree) . . . OK, let’s do it properly now: >>> Phylo.draw(tree) Eric Talevich IOB Workshop: Biopython
  44. 44. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresModify the tree Check the tree object for its methods: >>> help(tree) Try a few: >>> tree.get terminals() >>> clade = tree.common ancestor("A", "B") >>> clade.color = "red" >>> tree.root with outgroup("D", "E") >>> tree.ladderize() >>> Phylo.draw(tree) Eric Talevich IOB Workshop: Biopython
  45. 45. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresExternal applications Biopython wraps a number of external programs for phylogenetics. We’re not going to use them now, but here’s where to find them: Bio.Phylo.PAML — PAML wrappers & helpers Bio.Phylo.Applications — command-line wrapper for PhyML (PhymlCommandline); RAxML and others on the way. (Anything you’d like to see sooner?) Bio.Emboss.Applications — other tools ported via Embassy, including Phylip Eric Talevich IOB Workshop: Biopython
  46. 46. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Protein structures Eric Talevich IOB Workshop: Biopython
  47. 47. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresGoing 3D: The PDB module Load a structure: >>> from Bio import PDB >>> parser = PDB.PDBParser() >>> struct = parser.get structure(’1ATP’, ’1ATP.pdb’) Eric Talevich IOB Workshop: Biopython
  48. 48. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresGoing 3D: The PDB module Load a structure: >>> from Bio import PDB >>> parser = PDB.PDBParser() >>> struct = parser.get structure(’1ATP’, ’1ATP.pdb’) Inspect the object hierarchy: >>> list(struct) >>> model = struct[0] >>> list(model) >>> chain = model[’E’] >>> list(chain) >>> residue = chain[15] >>> list(residue) Eric Talevich IOB Workshop: Biopython
  49. 49. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresFigure: The “SMCRA” object hierarchy Eric Talevich IOB Workshop: Biopython
  50. 50. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresExtracting a peptide sequence Get the amino acid sequence through a Polypeptide object: >>> from Bio import PDB >>> parser = PDB.PDBParser() >>> struct = parser.get structure(’1ATP’, ... ’1ATP.pdb’) >>> ppb = PDB.PPBuilder() >>> peptides = ppb.build peptides(struct) >>> for pep in peptides: ... print pep.get sequence() Eric Talevich IOB Workshop: Biopython
  51. 51. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresCalculating RMSD Given two aligned structures, filter a list of target residues for high RMS deviation. Input: list of residue positions (integers) two equivalent chains from aligned protein models — residue numbers must match Minimum RMSD value (float) Output: list of residue positions, filtered Procedure: 1 Extract coordinates of Cα atoms 2 If available (not glycine), extract Cβ coordinates, too 3 Use Bio.SVDSuperimposer to calculate the RMSD between coordinates 4 Compare to the given RMSD threshold Eric Talevich IOB Workshop: Biopython
  52. 52. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresfrom Bio . SVDSuperimposer import SVDSuperimposerfrom numpy import a r r a ydef f i l t r m s ( r e s i d s , r e f c h a i n , cmpchain , t h r e s h = 0 . 5 ) : s u p e r = SVDSuperimposer ( ) for res in r e s i d s : refres = refchain [ res ] cmpres = cmpchain [ r e s ] c o o r d 1 = [ r e f r e s [ ’CA ’ ] . g e t c o o r d ( ) ] c o o r d 2 = [ cmpres [ ’CA ’ ] . g e t c o o r d ( ) ] i f r e f r e s . h a s i d ( ’CB ’ ) and cmpres . h a s i d ( ’CB ’ ) : # Not g l y c i n e c o o r d 1 . append ( r e f r e s [ ’CB ’ ] . g e t c o o r d ( ) ) c o o r d 2 . append ( c m p r e s [ ’CB ’ ] . g e t c o o r d ( ) ) super . s e t ( a r r a y ( coord1 ) , a r r a y ( coord2 )) rmsd = s u p e r . g e t i n i t r m s ( ) i f rmsd >= t h r e s h o l d : yield res Eric Talevich IOB Workshop: Biopython
  53. 53. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresFigure: Superimposed structures, with selected deviating residues Eric Talevich IOB Workshop: Biopython
  54. 54. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structuresFurther reading Biopython tutorial: http: //biopython.org/DIST/docs/tutorial/Tutorial.html Biopython wiki: http://biopython.org/ This presentation: http://www.slideshare.net/etalevich/ biopython-programming-workshop-at-uga Eric Talevich IOB Workshop: Biopython
  55. 55. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Thanks ’Preciate it. Gracias Eric Talevich IOB Workshop: Biopython
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×