SlideShare a Scribd company logo
1 of 55
Download to read offline
Sequences and alignments
    NCBI EUtils and BLAST
              Phylogenetics
          Protein structures




   IOB Workshop: Biopython
A programming toolkit for bioinformatics


                    Eric Talevich

 Institute of Bioinformatics, University of Georgia


                   Mar. 29, 2012




               Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
 NCBI EUtils and BLAST
           Phylogenetics
       Protein structures




      Getting started
                        with




            Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                    NCBI EUtils and BLAST
                              Phylogenetics
                          Protein structures


Installing Python


  Biopython is a library for the Python programming language.

  First, you’ll need these installed:
   Python 2.7 from http://python.org. It may already be
              installed on your computer. (Version 2.6 is OK, too.)
         IDLE, a simple Integrated DeveLopment Environment.
               Usually bundled with the Python distribution.

  Now, start an interactive session in IDLE.             1




     1
      On your own, check out IPython (http://ipython.scipy.org/). It’s an
  enhanced Python interpreter that feels somewhat like R.
                               Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                   NCBI EUtils and BLAST
                             Phylogenetics
                         Protein structures


Installing Python packages

  Biopython is a Python package. There are a few standard ways to
  install Python packages:
  From source: Download from PyPI 2 , unpack and install with the
               included setup.py script.
   easy install: Install from source 3 , then use the easy install
                 command to fetch install all other packages by name:
                 $ easy install <package name>
          pip: Like easy install, use pip 4 to manage packages:
               $ pip install <package name>

     2
       http://pypi.python.org/pypi/
     3
       http://pypi.python.org/pypi/setuptools
     4
       http://pypi.python.org/pypi/pip
                              Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                    NCBI EUtils and BLAST
                              Phylogenetics
                          Protein structures


Installing NumPy, matplotlib and Biopython

  Biopython relies on a few other Python packages for extra
  functionality. We’ll use these:
       numpy — efficient numerical functions and data structures
       (for Bio.PDB)
       matplotlib — plotting (for Bio.Phylo)

  Then finally:
       biopython — the reason we’re here today


  (Biopython, NumPy, matplotlib, setuptools and pip are also packaged for
  many Linux distributions.)


                               Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                  NCBI EUtils and BLAST
                            Phylogenetics
                        Protein structures


Testing

      Check your Biopython installation:
             >>> import Bio
             >>> print Bio. version


      Import a NumPy-based component:
             >>> from Bio import PDB


      Show a simple plot:
             >>> from matplotlib import pyplot
             >>> pyplot.plot(range(5), range(5))
             >>> pyplot.show()

                             Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
 NCBI EUtils and BLAST
           Phylogenetics
       Protein structures




    Let’s start using




            Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                  NCBI EUtils and BLAST
                            Phylogenetics
                        Protein structures




                             Biopython
1   Sequences and alignments
      The Seq object
      SeqIO and the SeqRecord object

2   NCBI EUtils and BLAST
     EUtils: Entrez Programming Utilities
     NCBI Blast
     External programs

3   Phylogenetics

4   Protein structures



                             Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
 NCBI EUtils and BLAST      The Seq object
           Phylogenetics    SeqIO and the SeqRecord object
       Protein structures




           Sequences
                and
           Alignments




            Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                  NCBI EUtils and BLAST      The Seq object
                            Phylogenetics    SeqIO and the SeqRecord object
                        Protein structures


The Seq object


        >>> from Bio.Seq import Seq
        >>> myseq = Seq(’AGTACACTGGT’)
        >>> myseq
        Seq(’AGTACACTGGT’, Alphabet())
        >>> print myseq
        AGTACACTGGT
        >>> myseq.transcribe()
        Seq(’AGUACACUGGU’, RNAAlphabet())
        >>> myseq.translate()
        Seq(’STL’, ExtendedIUPACProtein())


                             Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                 NCBI EUtils and BLAST      The Seq object
                           Phylogenetics    SeqIO and the SeqRecord object
                       Protein structures




A Seq object consists of:
       data — the underlying Python character string
   alphabet — DNA, RNA, protein, etc.

It supports most Python string methods:
        >>> myseq.count(’GT’)
        2

And some biology-specific methods, too:
       >>> myseq.reverse complement()
       Seq(’ACCAGTGTACT’, Alphabet())

Intrigued? Read on:
          >>> help(Seq)


                            Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                     NCBI EUtils and BLAST        The Seq object
                               Phylogenetics      SeqIO and the SeqRecord object
                           Protein structures


SeqIO: Sequence Input/Output

  Sequence data is stored in many different file formats.
  Bio.SeqIO supports:

                abi         fastq                 phylip            swiss
                ace       genbank                   pir              tab
              clustal         ig                   qual          uniprot-xml
               embl         imgt                  seqxml
              emboss       nexus                    sff
               fasta         phd                stockholm

  Manually fetch some data from the PDB website:                          5

         1ATP.fasta — two protein sequences, FASTA format
         1ATP.pdb — the 3D structure, for later
    5
        http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATP
                                Eric Talevich     IOB Workshop: Biopython
Sequences and alignments
                   NCBI EUtils and BLAST      The Seq object
                             Phylogenetics    SeqIO and the SeqRecord object
                         Protein structures


The SeqIO API


  SeqIO provides four functions:
        parse: Iteratively parse all elements in the file
         read: Parse a one-element file and return the element
        write: Write elements to a file
      convert: Parse one format and immediately write another

  Biopython uses the same I/O conventions for alignments
  (AlignIO), BLAST results (Blast), and phylogenetic trees
  (Phylo).



                              Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                   NCBI EUtils and BLAST      The Seq object
                             Phylogenetics    SeqIO and the SeqRecord object
                         Protein structures


The SeqRecord object

  SeqIO.parse returns SeqRecords.
  SeqRecord wraps a Seq object and attaches metadata.

   1   Pass the file name to the SeqIO parser; specify FASTA format:
            from Bio import SeqIO
            seqrecs = SeqIO.parse("1ATP.fasta", "fasta")
            print seqrecs




                              Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                    NCBI EUtils and BLAST      The Seq object
                              Phylogenetics    SeqIO and the SeqRecord object
                          Protein structures


The SeqRecord object

  SeqIO.parse returns SeqRecords.
  SeqRecord wraps a Seq object and attaches metadata.

   1   Pass the file name to the SeqIO parser; specify FASTA format:
            from Bio import SeqIO
            seqrecs = SeqIO.parse("1ATP.fasta", "fasta")
            print seqrecs
   2   To see all records at once, convert the iterator to a list:
            allrecs = list(seqrecs)
            print allrecs[0]
            print allrecs[0].seq


                               Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                    NCBI EUtils and BLAST      The Seq object
                              Phylogenetics    SeqIO and the SeqRecord object
                          Protein structures


Example: Shuffled sequences

        Given a real DNA sequence, create a “background” set of
        randomized sequences with the same composition.
  Procedure:
    1   Read the source sequence from a file
        – Use Bio.SeqIO




                               Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                    NCBI EUtils and BLAST      The Seq object
                              Phylogenetics    SeqIO and the SeqRecord object
                          Protein structures


Example: Shuffled sequences

        Given a real DNA sequence, create a “background” set of
        randomized sequences with the same composition.
  Procedure:
    1   Read the source sequence from a file
        – Use Bio.SeqIO
    2   In a loop:
            Shuffle the sequence
            – Use random.shuffle from Python’s standard library
            Create a new SeqRecord from the shuffled sequence
            – Because SeqIO.write works with SeqRecords




                               Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                    NCBI EUtils and BLAST      The Seq object
                              Phylogenetics    SeqIO and the SeqRecord object
                          Protein structures


Example: Shuffled sequences

        Given a real DNA sequence, create a “background” set of
        randomized sequences with the same composition.
  Procedure:
    1   Read the source sequence from a file
        – Use Bio.SeqIO
    2   In a loop:
            Shuffle the sequence
            – Use random.shuffle from Python’s standard library
            Create a new SeqRecord from the shuffled sequence
            – Because SeqIO.write works with SeqRecords
    3   Write the shuffled SeqRecords to another file


                               Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                      NCBI EUtils and BLAST      The Seq object
                                Phylogenetics    SeqIO and the SeqRecord object
                            Protein structures

import random
from Bio import SeqIO
from Bio . Seq import Seq
from Bio . SeqRecord import SeqRecord

o r i g r e c = SeqIO . r e a d ( "gi2.gb" , " genbank " )
alphabet = o r i g r e c . seq . alphabet
out recs = []
for i in xrange (1 , 31):
       n u c l e o t i d e s = l i s t ( o r i g r e c . seq )
       random . s h u f f l e ( n u c l e o t i d e s )
       n e w s e q = Seq ( "" . j o i n ( n u c l e o t i d e s ) , a l p h a b e t )
       n e w r e c = SeqRecord ( new seq ,
                                          i d=" shuffle " + s t r ( i ) )
       o u t r e c s . append ( n e w r e c )

SeqIO . w r i t e ( o u t r e c s , " gi2_shuffled . fasta " , " fasta " )


                                 Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                    NCBI EUtils and BLAST      The Seq object
                              Phylogenetics    SeqIO and the SeqRecord object
                          Protein structures


Example: ORF translation

        Split a set of unannotated DNA sequences into unique
        ORFs, translating in all 6 frames.
  Biopython can help with each piece of this problem:
    1   Parse the given unannotated DNA sequences (SeqIO.parse)
    2   Get the template strand’s sequence (Seq.reverse complement)
    3   Translate both strands into protein sequences (Seq.translate)
    4   Shift each strand by +1 and +2 for alternate reading frames
        (string-like Seq slicing)
    5   Split sequences at stop codons (Seq.split(’*’))
    6   Write translated sequences to a new file (SeqIO.write)


                               Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                     NCBI EUtils and BLAST      The Seq object
                               Phylogenetics    SeqIO and the SeqRecord object
                           Protein structures



def t r a n s l a t e s i x f r a m e s ( seq , t a b l e =1):
    ””” T r a n s l a t e a n u c l e o t i d e s e q u e n c e i n 6 f r a m e s .

      R e t u r n s an i t e r a b l e o f 6 t r a n s l a t e d p r o t e i n
      sequences .
      ”””
      rev = seq . reverse complement ()
      for i in range ( 3 ) :
             # Coding ( C r i c k ) s t r a n d
              y i e l d seq [ i : ] . t r a n s l a t e ( t a b l e )
             # Template ( Watson ) s t r a n d
              y i e l d rev [ i : ] . t ransla te ( table )




                                Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                     NCBI EUtils and BLAST      The Seq object
                               Phylogenetics    SeqIO and the SeqRecord object
                           Protein structures



def t r a n s l a t e o r f s ( s e q u e n c e s , m i n p r o t l e n =60):
    ””” F i n d and t r a n s l a t e a l l ORFs i n s e q u e n c e s .

      T r a n s l a t e s each sequence i n a l l 6 r e a d i n g frames ,
      s p l i t s s e q u e n c e s on s t o p codons , and p r o d u c e s an
      i t e r a b l e of a l l p r o t e i n sequences of length at
      least min prot len .
      ”””
      for seq in sequences :
              for frame in t r a n s l a t e s i x f r a m e s ( seq ) :
                      f o r p r o t i n f r a m e . s p l i t ( "*" ) :
                            i f l e n ( p r o t ) >= m i n p r o t l e n :
                                   y i e l d prot




                                Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                  NCBI EUtils and BLAST      The Seq object
                            Phylogenetics    SeqIO and the SeqRecord object
                        Protein structures



from Bio import SeqIO
from Bio . SeqRecord import SeqRecord

if    name         == " __main__ " :
     import s y s
     i n f i l e = sys . stdin
     o u t f i l e = sys . stdout
     r e c o r d s = SeqIO . p a r s e ( i n f i l e , " fasta " )
     seqs = ( rec . seq for rec in r e c o r d s )
     proteins = t r a n s l a t e o r f s ( seqs )
     s e q r e c s = ( SeqRecord ( seq , i d="orf"+s t r ( i ) )
                       for i , seq in enumerate ( o r f s ) )
     SeqIO . w r i t e ( s r e c s , o u t f i l e , " fasta " )




                             Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                          NCBI EUtils and BLAST      The Seq object
                                    Phylogenetics    SeqIO and the SeqRecord object
                                Protein structures


AlignIO and the Alignment object

  Alignment: a set of sequences with the same length and alphabet.
  Use AlignIO just like SeqIO:
     >>> from Bio import AlignIO
     >>> aln = AlignIO.read("PF01601.sto", "stockholm")
     >>> print aln
  SingleLetterAlphabet() alignment with 22 rows and 730 columns
  NCTDAV-----LTYSSFGVCADGSIIA-VQPRNV-----SYDSV...HIQ Q1HVL3 CVH22/539-1170
  NCTTAV-----MTYSNFGICADGSLIP-VRPRNS-----SDNGI...HVQ SPIKE CVHNL/723-1356
  NCTEPV-----LVYSNIGVCKSGSIGY-VPSQS------GQVKI...HVQ Q692M1 9CORO/740-1383
  NCTEPA-----LVYSNIGVCKNGAIGL-VGIRN------TQPKI...HIQ Q0Q4F4 9CORO/729-1360
  NCTSPR-----LVYSNIGVCTSGAIGL-LSPKX------AQPQI...HVQ Q0Q4F6 9CORO/743-1371
  NCTNPV-----LTYSSYGVCPDGSITR-LGLTD------VQPHF...--T A4ULL0 9CORO/726-1328
  NCTKPV-----LSYGPISVCSDGAIAG-TSTLQN-----TRPSI...KEW A6N263 9CORO/406-1035
  ECDIPIGAGICASYHTVSLLRSTSQKSIVAYTMS------LGAD...HYT Q6T7X8 CVHSA/647-1255
  ...

  DCE-PV-----ITYSNIGVCKNGAFVF-INVTH------SDGDV...HVH Q0PKZ5 CVPPU/797-1449




                                     Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
 NCBI EUtils and BLAST      The Seq object
           Phylogenetics    SeqIO and the SeqRecord object
       Protein structures




      Snack Time




            Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                            EUtils: Entrez Programming Utilities
 NCBI EUtils and BLAST
                            NCBI Blast
           Phylogenetics
                            External programs
       Protein structures




EUtils and BLAST



            Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                                           EUtils: Entrez Programming Utilities
                NCBI EUtils and BLAST
                                           NCBI Blast
                          Phylogenetics
                                           External programs
                      Protein structures


EUtils: Entrez Programming Utilities

      Access NCBI’s online services:
      from Bio import Entrez
      Entrez.email = "you@uga.edu"




                           Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                                           EUtils: Entrez Programming Utilities
                NCBI EUtils and BLAST
                                           NCBI Blast
                          Phylogenetics
                                           External programs
                      Protein structures


EUtils: Entrez Programming Utilities

      Access NCBI’s online services:
      from Bio import Entrez
      Entrez.email = "you@uga.edu"

      Request a GenBank record:
      handle = Entrez.efetch(db="protein", id="69316",
                rettype="gb", retmode="text")
      record = SeqIO.read(handle, "gb")




                           Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                                           EUtils: Entrez Programming Utilities
                NCBI EUtils and BLAST
                                           NCBI Blast
                          Phylogenetics
                                           External programs
                      Protein structures


EUtils: Entrez Programming Utilities

      Access NCBI’s online services:
      from Bio import Entrez
      Entrez.email = "you@uga.edu"

      Request a GenBank record:
      handle = Entrez.efetch(db="protein", id="69316",
                rettype="gb", retmode="text")
      record = SeqIO.read(handle, "gb")

      Specify multiple IDs in one query:
      handle = Entrez.efetch(db="protein",
                id="349839,349840",
                rettype="fasta", retmode="text")
      records = SeqIO.parse(handle, "fasta")
                           Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                                               EUtils: Entrez Programming Utilities
                    NCBI EUtils and BLAST
                                               NCBI Blast
                              Phylogenetics
                                               External programs
                          Protein structures


Interlude: SeqRecord attributes
            seq: the sequence (Seq) itself
             id: primary ID for the sequence, e.g. accession number
                 (string)
          name: “common” name/id for the sequence, like GenBank
                 LOCUS id
   description: human-readible description of the sequence
  letter annotations: restricted dictionary of additional info about
                 individual letters in the sequence, e.g. quality scores
  annotations: dictionary of additional unstructured info
       features: list of SeqFeature objects with more structured
                 information — e.g. position of genes on a genome,
                 domains on a protein sequence.
        dbxrefs: list of database cross-references (strings)
                               Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                                               EUtils: Entrez Programming Utilities
                    NCBI EUtils and BLAST
                                               NCBI Blast
                              Phylogenetics
                                               External programs
                          Protein structures

from Bio import E n t r e z , SeqIO
E n t r e z . e m a i l = " me@uga .edu"

h a n d l e = E n t r e z . e f e t c h ( db=" nucleotide " , i d=" M95169 " ,
                                          r e t t y p e="gb" , r et m od e="text" )
r e c o r d = SeqIO . r e a d ( h a n d l e , " genbank " )
handle . c l o s e ()
print record
print record . features [10]
s l i c e d = record [20000:]                  # L a s t ˜25% o f t h e genome
print s l i c e d

from Bio . Seq import Seq
from Bio . A l p h a b e t import g e n e r i c p r o t e i n
t r a n s l a t i o n s = [ f . q u a l i f i e r s [ " translation " ]
                            for f in record . f e a t u r e s [ 1 : ] ]
p r o t e i n s = [ Seq ( t [ 0 ] , g e n e r i c p r o t e i n )
                       for t in t r a n s l a t i o n s ]

                               Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                                             EUtils: Entrez Programming Utilities
                  NCBI EUtils and BLAST
                                             NCBI Blast
                            Phylogenetics
                                             External programs
                        Protein structures


NCBI Blast

  BLAST can be used either standalone or through NCBI’s server.
       Online:   >>> from Bio.Blast import NCBIWWW
                 >>> result handle = NCBIWWW.qblast(
                     ’blastp’, ’nr’, query string)
  Standalone: “Legacy” (blastall):
                 >>> from Bio.Blast.Applications import
              BlastallCommandline
                 >>> help(BlastallCommandline)
              New hotness (Blast+):
                 >>> from Bio.Blast.Applications import
              NcbiblastpCommandline
                 >>> help(NcbiblastpCommandline)


                             Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                                             EUtils: Entrez Programming Utilities
                  NCBI EUtils and BLAST
                                             NCBI Blast
                            Phylogenetics
                                             External programs
                        Protein structures


Parsing BLAST output



  BLAST produces reports in plain-text and XML format.

  Biopython requests XML by default.
         >>>   from Bio.Blast import NCBIWWW, NCBIXML
         >>>   result handle = NCBIWWW.qblast(’blastp’,
         ...            ’nr’, query string)
         >>>   blast record = NCBIXML.read(result handle)
         >>>   print blast record




                             Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                                               EUtils: Entrez Programming Utilities
                    NCBI EUtils and BLAST
                                               NCBI Blast
                              Phylogenetics
                                               External programs
                          Protein structures

# S e a r c h f o r homologs o f a p r o t e i n s e q u e n c e

from Bio import SeqIO
from Bio . B l a s t import NBCIWWW, NCBIXML

# Read and r e f o r m a t t h e q u e r y s e q u e n c e
s e q r e c = SeqIO . r e a d ( ’gi2.gb ’ , ’gb ’ )
q u e r y = s e q r e c . f o r m a t ( ’fasta ’ )

# Submit an o n l i n e BLAST q u e r y
# ( T h i s t a k e s some t i m e t o r u n )
r e s u l t h a n d l e = NCBIWWW. q b l a s t ( ’blastx ’ , ’nr ’ , q u e r y )




                               Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                                              EUtils: Entrez Programming Utilities
                   NCBI EUtils and BLAST
                                              NCBI Blast
                             Phylogenetics
                                              External programs
                         Protein structures

# 1 . Save t h e BLAST r e s u l t s a s an XML f i l e

w i t h open ( ’aprotinin_blast .xml ’ , ’w’ ) a s s a v e f i l e :
    s a v e f i l e . write ( r e s u l t h a n d l e . read ())
result handle . close ()

# NB : The BLAST r e s u l t h a n d l e can o n l y be r e a d once
# R e l o a d i t from t h e f i l e
w i t h open ( ’aprotinin_blast .xml ’ ) a s r e s u l t h a n d l e :
    b l a s t r e c o r d = NCBIXML . r e a d ( r e s u l t h a n d l e )




                              Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                                                 EUtils: Entrez Programming Utilities
                      NCBI EUtils and BLAST
                                                 NCBI Blast
                                Phylogenetics
                                                 External programs
                            Protein structures

# 2 . D i s p l a y a h i s t o g r a m o f BLAST h i t s c o r e s

def g e t s c o r e s ( a l i g n m e n t s ) :
    for aln in alignments :
          f o r hsp i n a l n . h s p s :
                  y i e l d hsp . s c o r e

scores = l i s t ( get scores ( blast record . alignments ))

# Draw t h e h i s t o g r a m
import p y l a b
p y l a b . h i s t ( s c o r e s , b i n s =20)
p y l a b . t i t l e ( " Scores of %d BLAST hits" % l e n ( s c o r e s ) )
p y l a b . x l a b e l ( " BLAST score " )
p y l a b . y l a b e l ( "# hits" )
p y l a b . show ( )
# Save a copy f o r l a t e r
p y l a b . s a v e f i g ( ’aprotinin_scores .png ’ )

                                 Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                                    EUtils: Entrez Programming Utilities
         NCBI EUtils and BLAST
                                    NCBI Blast
                   Phylogenetics
                                    External programs
               Protein structures




Figure: Histogram of BLAST scores generated by pylab
                    Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                                                  EUtils: Entrez Programming Utilities
                       NCBI EUtils and BLAST
                                                  NCBI Blast
                                 Phylogenetics
                                                  External programs
                             Protein structures

# 3 . E x t r a c t t h e s e q u e n c e s o f h i g h −s c o r i n g BLAST h i t s

from Bio . Seq import Seq
from Bio . SeqRecord import SeqRecord

def g e t h s p s ( a l i g n m e n t s , t h r e s h o l d ) :
    for aln in alignments :
          f o r hsp i n a l n . h s p s :
                  i f hsp . s c o r e >= t h r e s h o l d :
                           y i e l d SeqRecord ( Seq ( hsp . s b j c t ) ,
                                                        i d=a l n . a c c e s s i o n )
                           break

b e s t s e q s = g e t h s p s ( b l a s t r e c o r d . alignments , 321)
SeqIO . w r i t e ( b e s t s e q s , ’aprotinin . fasta ’ , ’fasta ’ )




                                  Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                                              EUtils: Entrez Programming Utilities
                   NCBI EUtils and BLAST
                                              NCBI Blast
                             Phylogenetics
                                              External programs
                         Protein structures


Calling other external programs



  Biopython has wrappers for other command-line programs in:
  Bio.Blast.Applications — the Blast+ suite
  Bio.Align.Applications — Muscle, ClustalW, . . .
  Bio.Emboss.Applications — needle, water, . . .


  Let’s re-align our BLAST results using Muscle, and format the
  alignment for use with stand-alone Phylip.




                              Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                       NCBI EUtils and BLAST
                                 Phylogenetics
                             Protein structures

from Bio import A l i g n I O
from Bio . A l i g n . A p p l i c a t i o n s import MuscleCommandline
from S t r i n g I O import S t r i n g I O

# C o n s t r u c t t h e s h e l l command
muscle cmd = MuscleCommandline ( i n p u t=" aprotinin . fasta " )
# E x e c u t e t h e command
# Get o u t p u t ( t h e a l i g n m e n t ) and any e r r o r m e s s a g e s
m u s c l e o u t , m u s c l e e r r = muscle cmd ( )

# Read t h e a l i g n m e n t back i n
a l i g n = A l i g n I O . r e a d ( S t r i n g I O ( m u s c l e o u t ) , " fasta " )
# Format t h e a l i g n m e n t f o r P h y l i p
A l i g n I O . w r i t e ( [ a l i g n ] , ’aprotinin .phy ’ , ’phylip ’ )




                                  Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
 NCBI EUtils and BLAST
           Phylogenetics
       Protein structures




        Phylogenetics




            Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                   NCBI EUtils and BLAST
                             Phylogenetics
                         Protein structures


Phylogenetic tree I/O

  Start with:
          >>> from Bio import Phylo

  Input and output of trees is just like SeqIO:
   read, parse single or multiple trees in Newick, Nexus and
               PhyloXML formats
         write to any of the formats supported by read/parse
       convert between two formats in one step

  Use StringIO to load strings directly:
          >>> from cStringIO import StringIO
          >>> handle = StringIO("((A,B),(C,(D,E)));")
          >>> tree = Phylo.read(handle, "newick")

                              Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                     NCBI EUtils and BLAST
                               Phylogenetics
                           Protein structures


What’s in a tree?

  Make a tree with branch lengths:
      >>> tree = Phylo.read(StringIO("((A:1,B:1):2,
      ... (C:2,(D:1,E:1):1):1);"), "newick")

  View the object structure of the entire tree:
       >>> print tree

  Draw an “ASCII-art” (plain text) representation:
      >>> Phylo.draw ascii(tree)

  . . . OK, let’s do it properly now:
         >>> Phylo.draw(tree)


                                Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                   NCBI EUtils and BLAST
                             Phylogenetics
                         Protein structures


Modify the tree


  Check the tree object for its methods:
         >>> help(tree)

  Try a few:
          >>>   tree.get terminals()
          >>>   clade = tree.common ancestor("A", "B")
          >>>   clade.color = "red"
          >>>   tree.root with outgroup("D", "E")
          >>>   tree.ladderize()
          >>>   Phylo.draw(tree)




                              Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                  NCBI EUtils and BLAST
                            Phylogenetics
                        Protein structures


External applications


  Biopython wraps a number of external programs for phylogenetics.
  We’re not going to use them now, but here’s where to find them:
  Bio.Phylo.PAML — PAML wrappers & helpers
  Bio.Phylo.Applications — command-line wrapper for PhyML
              (PhymlCommandline); RAxML and others on the
              way. (Anything you’d like to see sooner?)
  Bio.Emboss.Applications — other tools ported via Embassy,
              including Phylip




                             Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
 NCBI EUtils and BLAST
           Phylogenetics
       Protein structures




              Protein
             structures




            Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                 NCBI EUtils and BLAST
                           Phylogenetics
                       Protein structures


Going 3D: The PDB module
     Load a structure:
          >>> from Bio import PDB
          >>> parser = PDB.PDBParser()
          >>> struct = parser.get structure(’1ATP’,
                         ’1ATP.pdb’)




                            Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                 NCBI EUtils and BLAST
                           Phylogenetics
                       Protein structures


Going 3D: The PDB module
     Load a structure:
          >>> from Bio import PDB
          >>> parser = PDB.PDBParser()
          >>> struct = parser.get structure(’1ATP’,
                         ’1ATP.pdb’)
     Inspect the object hierarchy:
          >>>   list(struct)
          >>>   model = struct[0]
          >>>   list(model)
          >>>   chain = model[’E’]
          >>>   list(chain)
          >>>   residue = chain[15]
          >>>   list(residue)
                            Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
  NCBI EUtils and BLAST
            Phylogenetics
        Protein structures




Figure: The “SMCRA” object hierarchy

             Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                  NCBI EUtils and BLAST
                            Phylogenetics
                        Protein structures


Extracting a peptide sequence


  Get the amino acid sequence through a Polypeptide object:
         >>>   from Bio import PDB
         >>>   parser = PDB.PDBParser()
         >>>   struct = parser.get structure(’1ATP’,
         ...            ’1ATP.pdb’)
         >>>   ppb = PDB.PPBuilder()
         >>>   peptides = ppb.build peptides(struct)
         >>>   for pep in peptides:
         ...        print pep.get sequence()




                             Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                  NCBI EUtils and BLAST
                            Phylogenetics
                        Protein structures


Calculating RMSD
     Given two aligned structures, filter a list of target
     residues for high RMS deviation.

       Input:       list of residue positions (integers)
                    two equivalent chains from aligned protein
                    models — residue numbers must match
                    Minimum RMSD value (float)
     Output: list of residue positions, filtered
   Procedure:   1   Extract coordinates of Cα atoms
                2   If available (not glycine), extract Cβ
                    coordinates, too
                3   Use Bio.SVDSuperimposer to calculate the
                    RMSD between coordinates
                4   Compare to the given RMSD threshold
                             Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                     NCBI EUtils and BLAST
                               Phylogenetics
                           Protein structures

from Bio . SVDSuperimposer import SVDSuperimposer
from numpy import a r r a y

def f i l t r m s ( r e s i d s , r e f c h a i n , cmpchain , t h r e s h = 0 . 5 ) :
    s u p e r = SVDSuperimposer ( )
    for res in r e s i d s :
              refres = refchain [ res ]
             cmpres = cmpchain [ r e s ]
             c o o r d 1 = [ r e f r e s [ ’CA ’ ] . g e t c o o r d ( ) ]
             c o o r d 2 = [ cmpres [ ’CA ’ ] . g e t c o o r d ( ) ]
              i f r e f r e s . h a s i d ( ’CB ’ ) and cmpres . h a s i d ( ’CB ’ ) :
                    # Not g l y c i n e
                     c o o r d 1 . append ( r e f r e s [ ’CB ’ ] . g e t c o o r d ( ) )
                     c o o r d 2 . append ( c m p r e s [ ’CB ’ ] . g e t c o o r d ( ) )
             super . s e t ( a r r a y ( coord1 ) , a r r a y ( coord2 ))
             rmsd = s u p e r . g e t i n i t r m s ( )
              i f rmsd >= t h r e s h o l d :
                     yield res

                                Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
              NCBI EUtils and BLAST
                        Phylogenetics
                    Protein structures




Figure: Superimposed structures, with selected deviating residues
                         Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
                   NCBI EUtils and BLAST
                             Phylogenetics
                         Protein structures


Further reading



      Biopython tutorial:
      http:
      //biopython.org/DIST/docs/tutorial/Tutorial.html
      Biopython wiki:
      http://biopython.org/
      This presentation:
      http://www.slideshare.net/etalevich/
      biopython-programming-workshop-at-uga




                              Eric Talevich   IOB Workshop: Biopython
Sequences and alignments
 NCBI EUtils and BLAST
           Phylogenetics
       Protein structures




             Thanks
                 ’Preciate it.
                      Gracias




            Eric Talevich   IOB Workshop: Biopython

More Related Content

Viewers also liked

Precise Fluid Control - Diener Precision Pumps
Precise Fluid Control - Diener Precision PumpsPrecise Fluid Control - Diener Precision Pumps
Precise Fluid Control - Diener Precision PumpsDiener Precision Pumps
 
Ventajas de las baterias zinc aire
Ventajas de las baterias zinc aireVentajas de las baterias zinc aire
Ventajas de las baterias zinc aireCegasaTrafico
 
Citi Ae Training
Citi Ae TrainingCiti Ae Training
Citi Ae Trainingguest882ca4
 
Datos inmuebles estandar xml urbaniza
Datos inmuebles estandar xml   urbanizaDatos inmuebles estandar xml   urbaniza
Datos inmuebles estandar xml urbanizacesar villasante
 
Live Communication as value added factor in marketing
Live Communication as value added factor in marketingLive Communication as value added factor in marketing
Live Communication as value added factor in marketingDagobert Hartmann
 
Canary Islands Hub tax incentives by Price Waterhouse Cooper
Canary Islands Hub tax incentives by Price Waterhouse CooperCanary Islands Hub tax incentives by Price Waterhouse Cooper
Canary Islands Hub tax incentives by Price Waterhouse CooperCanary Islands Hub
 
Anton Saputro Portfolio March 2012
Anton Saputro Portfolio March 2012Anton Saputro Portfolio March 2012
Anton Saputro Portfolio March 2012insomnia69
 
Congreso de presupuesto y finanzas públicas
Congreso de presupuesto y finanzas públicasCongreso de presupuesto y finanzas públicas
Congreso de presupuesto y finanzas públicasJuan bautista
 
Euro rscg millennials+socialmedia
Euro rscg millennials+socialmediaEuro rscg millennials+socialmedia
Euro rscg millennials+socialmediaMitya Voskresensky
 
Advanced Reflection in Pharo
Advanced Reflection in PharoAdvanced Reflection in Pharo
Advanced Reflection in PharoMarcus Denker
 
11 discipular lideres espirituales
11 discipular lideres espirituales11 discipular lideres espirituales
11 discipular lideres espiritualeschucho1943
 
EL ARRAYÁN EN LA ALHAMBRA
EL ARRAYÁN EN LA ALHAMBRAEL ARRAYÁN EN LA ALHAMBRA
EL ARRAYÁN EN LA ALHAMBRADesign Restauro
 
P5 espígol
P5 espígolP5 espígol
P5 espígolpsalaman
 

Viewers also liked (17)

Precise Fluid Control - Diener Precision Pumps
Precise Fluid Control - Diener Precision PumpsPrecise Fluid Control - Diener Precision Pumps
Precise Fluid Control - Diener Precision Pumps
 
Ventajas de las baterias zinc aire
Ventajas de las baterias zinc aireVentajas de las baterias zinc aire
Ventajas de las baterias zinc aire
 
Casadeco
CasadecoCasadeco
Casadeco
 
Citi Ae Training
Citi Ae TrainingCiti Ae Training
Citi Ae Training
 
Datos inmuebles estandar xml urbaniza
Datos inmuebles estandar xml   urbanizaDatos inmuebles estandar xml   urbaniza
Datos inmuebles estandar xml urbaniza
 
Live Communication as value added factor in marketing
Live Communication as value added factor in marketingLive Communication as value added factor in marketing
Live Communication as value added factor in marketing
 
Canary Islands Hub tax incentives by Price Waterhouse Cooper
Canary Islands Hub tax incentives by Price Waterhouse CooperCanary Islands Hub tax incentives by Price Waterhouse Cooper
Canary Islands Hub tax incentives by Price Waterhouse Cooper
 
Anton Saputro Portfolio March 2012
Anton Saputro Portfolio March 2012Anton Saputro Portfolio March 2012
Anton Saputro Portfolio March 2012
 
Congreso de presupuesto y finanzas públicas
Congreso de presupuesto y finanzas públicasCongreso de presupuesto y finanzas públicas
Congreso de presupuesto y finanzas públicas
 
Seminario Sobre Datasets Consorcio Madrono
Seminario Sobre Datasets Consorcio Madrono Seminario Sobre Datasets Consorcio Madrono
Seminario Sobre Datasets Consorcio Madrono
 
Www zenparaopositores com
Www zenparaopositores comWww zenparaopositores com
Www zenparaopositores com
 
Euro rscg millennials+socialmedia
Euro rscg millennials+socialmediaEuro rscg millennials+socialmedia
Euro rscg millennials+socialmedia
 
Advanced Reflection in Pharo
Advanced Reflection in PharoAdvanced Reflection in Pharo
Advanced Reflection in Pharo
 
Mein Traumurlaub
Mein TraumurlaubMein Traumurlaub
Mein Traumurlaub
 
11 discipular lideres espirituales
11 discipular lideres espirituales11 discipular lideres espirituales
11 discipular lideres espirituales
 
EL ARRAYÁN EN LA ALHAMBRA
EL ARRAYÁN EN LA ALHAMBRAEL ARRAYÁN EN LA ALHAMBRA
EL ARRAYÁN EN LA ALHAMBRA
 
P5 espígol
P5 espígolP5 espígol
P5 espígol
 

Similar to Biopython programming workshop at UGA

Ontology-based data access and semantic mining with Aber-OWL
Ontology-based data access and semantic mining with Aber-OWLOntology-based data access and semantic mining with Aber-OWL
Ontology-based data access and semantic mining with Aber-OWLRobert Hoehndorf
 
Bonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBOSC 2010
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartAraport
 
Bio bikepresentation
Bio bikepresentationBio bikepresentation
Bio bikepresentationabebebd
 
Biopython
BiopythonBiopython
Biopythonbosc
 
Antao Biopython Bosc2008
Antao Biopython Bosc2008Antao Biopython Bosc2008
Antao Biopython Bosc2008bosc_2008
 
Lecture 3 Genes And Proteins 10212009
Lecture 3 Genes And Proteins 10212009Lecture 3 Genes And Proteins 10212009
Lecture 3 Genes And Proteins 10212009meminie
 
BOSC 2008 Biopython
BOSC 2008 BiopythonBOSC 2008 Biopython
BOSC 2008 Biopythontiago
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS
 
InterPro and InterProScan 5.0
InterPro and InterProScan 5.0InterPro and InterProScan 5.0
InterPro and InterProScan 5.0EBI
 
Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...graphdevroom
 
Bio4j: A pioneer graph based database for the integration of biological Big Data
Bio4j: A pioneer graph based database for the integration of biological Big DataBio4j: A pioneer graph based database for the integration of biological Big Data
Bio4j: A pioneer graph based database for the integration of biological Big DataPablo Pareja Tobes
 
BioRuby -- Bioinformatics Library
BioRuby -- Bioinformatics LibraryBioRuby -- Bioinformatics Library
BioRuby -- Bioinformatics Libraryngotogenome
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBIgeetikaJethra
 
BioPerl Project Update
BioPerl Project UpdateBioPerl Project Update
BioPerl Project Updatebosc
 
Java Introductie
Java IntroductieJava Introductie
Java Introductiembruggen
 
551report.doc
551report.doc551report.doc
551report.docbutest
 

Similar to Biopython programming workshop at UGA (20)

The Infobiotics workbench
The Infobiotics workbenchThe Infobiotics workbench
The Infobiotics workbench
 
Ontology-based data access and semantic mining with Aber-OWL
Ontology-based data access and semantic mining with Aber-OWLOntology-based data access and semantic mining with Aber-OWL
Ontology-based data access and semantic mining with Aber-OWL
 
Bonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_ruby
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick Provart
 
Bio bikepresentation
Bio bikepresentationBio bikepresentation
Bio bikepresentation
 
Biopython
BiopythonBiopython
Biopython
 
Neo4j and bioinformatics
Neo4j and bioinformaticsNeo4j and bioinformatics
Neo4j and bioinformatics
 
Antao Biopython Bosc2008
Antao Biopython Bosc2008Antao Biopython Bosc2008
Antao Biopython Bosc2008
 
Lecture 3 Genes And Proteins 10212009
Lecture 3 Genes And Proteins 10212009Lecture 3 Genes And Proteins 10212009
Lecture 3 Genes And Proteins 10212009
 
BOSC 2008 Biopython
BOSC 2008 BiopythonBOSC 2008 Biopython
BOSC 2008 Biopython
 
Bio4j
Bio4jBio4j
Bio4j
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
 
InterPro and InterProScan 5.0
InterPro and InterProScan 5.0InterPro and InterProScan 5.0
InterPro and InterProScan 5.0
 
Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...
 
Bio4j: A pioneer graph based database for the integration of biological Big Data
Bio4j: A pioneer graph based database for the integration of biological Big DataBio4j: A pioneer graph based database for the integration of biological Big Data
Bio4j: A pioneer graph based database for the integration of biological Big Data
 
BioRuby -- Bioinformatics Library
BioRuby -- Bioinformatics LibraryBioRuby -- Bioinformatics Library
BioRuby -- Bioinformatics Library
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBI
 
BioPerl Project Update
BioPerl Project UpdateBioPerl Project Update
BioPerl Project Update
 
Java Introductie
Java IntroductieJava Introductie
Java Introductie
 
551report.doc
551report.doc551report.doc
551report.doc
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Biopython programming workshop at UGA

  • 1. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures IOB Workshop: Biopython A programming toolkit for bioinformatics Eric Talevich Institute of Bioinformatics, University of Georgia Mar. 29, 2012 Eric Talevich IOB Workshop: Biopython
  • 2. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Getting started with Eric Talevich IOB Workshop: Biopython
  • 3. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Installing Python Biopython is a library for the Python programming language. First, you’ll need these installed: Python 2.7 from http://python.org. It may already be installed on your computer. (Version 2.6 is OK, too.) IDLE, a simple Integrated DeveLopment Environment. Usually bundled with the Python distribution. Now, start an interactive session in IDLE. 1 1 On your own, check out IPython (http://ipython.scipy.org/). It’s an enhanced Python interpreter that feels somewhat like R. Eric Talevich IOB Workshop: Biopython
  • 4. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Installing Python packages Biopython is a Python package. There are a few standard ways to install Python packages: From source: Download from PyPI 2 , unpack and install with the included setup.py script. easy install: Install from source 3 , then use the easy install command to fetch install all other packages by name: $ easy install <package name> pip: Like easy install, use pip 4 to manage packages: $ pip install <package name> 2 http://pypi.python.org/pypi/ 3 http://pypi.python.org/pypi/setuptools 4 http://pypi.python.org/pypi/pip Eric Talevich IOB Workshop: Biopython
  • 5. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Installing NumPy, matplotlib and Biopython Biopython relies on a few other Python packages for extra functionality. We’ll use these: numpy — efficient numerical functions and data structures (for Bio.PDB) matplotlib — plotting (for Bio.Phylo) Then finally: biopython — the reason we’re here today (Biopython, NumPy, matplotlib, setuptools and pip are also packaged for many Linux distributions.) Eric Talevich IOB Workshop: Biopython
  • 6. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Testing Check your Biopython installation: >>> import Bio >>> print Bio. version Import a NumPy-based component: >>> from Bio import PDB Show a simple plot: >>> from matplotlib import pyplot >>> pyplot.plot(range(5), range(5)) >>> pyplot.show() Eric Talevich IOB Workshop: Biopython
  • 7. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Let’s start using Eric Talevich IOB Workshop: Biopython
  • 8. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Biopython 1 Sequences and alignments The Seq object SeqIO and the SeqRecord object 2 NCBI EUtils and BLAST EUtils: Entrez Programming Utilities NCBI Blast External programs 3 Phylogenetics 4 Protein structures Eric Talevich IOB Workshop: Biopython
  • 9. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures Sequences and Alignments Eric Talevich IOB Workshop: Biopython
  • 10. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures The Seq object >>> from Bio.Seq import Seq >>> myseq = Seq(’AGTACACTGGT’) >>> myseq Seq(’AGTACACTGGT’, Alphabet()) >>> print myseq AGTACACTGGT >>> myseq.transcribe() Seq(’AGUACACUGGU’, RNAAlphabet()) >>> myseq.translate() Seq(’STL’, ExtendedIUPACProtein()) Eric Talevich IOB Workshop: Biopython
  • 11. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures A Seq object consists of: data — the underlying Python character string alphabet — DNA, RNA, protein, etc. It supports most Python string methods: >>> myseq.count(’GT’) 2 And some biology-specific methods, too: >>> myseq.reverse complement() Seq(’ACCAGTGTACT’, Alphabet()) Intrigued? Read on: >>> help(Seq) Eric Talevich IOB Workshop: Biopython
  • 12. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures SeqIO: Sequence Input/Output Sequence data is stored in many different file formats. Bio.SeqIO supports: abi fastq phylip swiss ace genbank pir tab clustal ig qual uniprot-xml embl imgt seqxml emboss nexus sff fasta phd stockholm Manually fetch some data from the PDB website: 5 1ATP.fasta — two protein sequences, FASTA format 1ATP.pdb — the 3D structure, for later 5 http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATP Eric Talevich IOB Workshop: Biopython
  • 13. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures The SeqIO API SeqIO provides four functions: parse: Iteratively parse all elements in the file read: Parse a one-element file and return the element write: Write elements to a file convert: Parse one format and immediately write another Biopython uses the same I/O conventions for alignments (AlignIO), BLAST results (Blast), and phylogenetic trees (Phylo). Eric Talevich IOB Workshop: Biopython
  • 14. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures The SeqRecord object SeqIO.parse returns SeqRecords. SeqRecord wraps a Seq object and attaches metadata. 1 Pass the file name to the SeqIO parser; specify FASTA format: from Bio import SeqIO seqrecs = SeqIO.parse("1ATP.fasta", "fasta") print seqrecs Eric Talevich IOB Workshop: Biopython
  • 15. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures The SeqRecord object SeqIO.parse returns SeqRecords. SeqRecord wraps a Seq object and attaches metadata. 1 Pass the file name to the SeqIO parser; specify FASTA format: from Bio import SeqIO seqrecs = SeqIO.parse("1ATP.fasta", "fasta") print seqrecs 2 To see all records at once, convert the iterator to a list: allrecs = list(seqrecs) print allrecs[0] print allrecs[0].seq Eric Talevich IOB Workshop: Biopython
  • 16. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures Example: Shuffled sequences Given a real DNA sequence, create a “background” set of randomized sequences with the same composition. Procedure: 1 Read the source sequence from a file – Use Bio.SeqIO Eric Talevich IOB Workshop: Biopython
  • 17. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures Example: Shuffled sequences Given a real DNA sequence, create a “background” set of randomized sequences with the same composition. Procedure: 1 Read the source sequence from a file – Use Bio.SeqIO 2 In a loop: Shuffle the sequence – Use random.shuffle from Python’s standard library Create a new SeqRecord from the shuffled sequence – Because SeqIO.write works with SeqRecords Eric Talevich IOB Workshop: Biopython
  • 18. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures Example: Shuffled sequences Given a real DNA sequence, create a “background” set of randomized sequences with the same composition. Procedure: 1 Read the source sequence from a file – Use Bio.SeqIO 2 In a loop: Shuffle the sequence – Use random.shuffle from Python’s standard library Create a new SeqRecord from the shuffled sequence – Because SeqIO.write works with SeqRecords 3 Write the shuffled SeqRecords to another file Eric Talevich IOB Workshop: Biopython
  • 19. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures import random from Bio import SeqIO from Bio . Seq import Seq from Bio . SeqRecord import SeqRecord o r i g r e c = SeqIO . r e a d ( "gi2.gb" , " genbank " ) alphabet = o r i g r e c . seq . alphabet out recs = [] for i in xrange (1 , 31): n u c l e o t i d e s = l i s t ( o r i g r e c . seq ) random . s h u f f l e ( n u c l e o t i d e s ) n e w s e q = Seq ( "" . j o i n ( n u c l e o t i d e s ) , a l p h a b e t ) n e w r e c = SeqRecord ( new seq , i d=" shuffle " + s t r ( i ) ) o u t r e c s . append ( n e w r e c ) SeqIO . w r i t e ( o u t r e c s , " gi2_shuffled . fasta " , " fasta " ) Eric Talevich IOB Workshop: Biopython
  • 20. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures Example: ORF translation Split a set of unannotated DNA sequences into unique ORFs, translating in all 6 frames. Biopython can help with each piece of this problem: 1 Parse the given unannotated DNA sequences (SeqIO.parse) 2 Get the template strand’s sequence (Seq.reverse complement) 3 Translate both strands into protein sequences (Seq.translate) 4 Shift each strand by +1 and +2 for alternate reading frames (string-like Seq slicing) 5 Split sequences at stop codons (Seq.split(’*’)) 6 Write translated sequences to a new file (SeqIO.write) Eric Talevich IOB Workshop: Biopython
  • 21. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures def t r a n s l a t e s i x f r a m e s ( seq , t a b l e =1): ””” T r a n s l a t e a n u c l e o t i d e s e q u e n c e i n 6 f r a m e s . R e t u r n s an i t e r a b l e o f 6 t r a n s l a t e d p r o t e i n sequences . ””” rev = seq . reverse complement () for i in range ( 3 ) : # Coding ( C r i c k ) s t r a n d y i e l d seq [ i : ] . t r a n s l a t e ( t a b l e ) # Template ( Watson ) s t r a n d y i e l d rev [ i : ] . t ransla te ( table ) Eric Talevich IOB Workshop: Biopython
  • 22. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures def t r a n s l a t e o r f s ( s e q u e n c e s , m i n p r o t l e n =60): ””” F i n d and t r a n s l a t e a l l ORFs i n s e q u e n c e s . T r a n s l a t e s each sequence i n a l l 6 r e a d i n g frames , s p l i t s s e q u e n c e s on s t o p codons , and p r o d u c e s an i t e r a b l e of a l l p r o t e i n sequences of length at least min prot len . ””” for seq in sequences : for frame in t r a n s l a t e s i x f r a m e s ( seq ) : f o r p r o t i n f r a m e . s p l i t ( "*" ) : i f l e n ( p r o t ) >= m i n p r o t l e n : y i e l d prot Eric Talevich IOB Workshop: Biopython
  • 23. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures from Bio import SeqIO from Bio . SeqRecord import SeqRecord if name == " __main__ " : import s y s i n f i l e = sys . stdin o u t f i l e = sys . stdout r e c o r d s = SeqIO . p a r s e ( i n f i l e , " fasta " ) seqs = ( rec . seq for rec in r e c o r d s ) proteins = t r a n s l a t e o r f s ( seqs ) s e q r e c s = ( SeqRecord ( seq , i d="orf"+s t r ( i ) ) for i , seq in enumerate ( o r f s ) ) SeqIO . w r i t e ( s r e c s , o u t f i l e , " fasta " ) Eric Talevich IOB Workshop: Biopython
  • 24. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures AlignIO and the Alignment object Alignment: a set of sequences with the same length and alphabet. Use AlignIO just like SeqIO: >>> from Bio import AlignIO >>> aln = AlignIO.read("PF01601.sto", "stockholm") >>> print aln SingleLetterAlphabet() alignment with 22 rows and 730 columns NCTDAV-----LTYSSFGVCADGSIIA-VQPRNV-----SYDSV...HIQ Q1HVL3 CVH22/539-1170 NCTTAV-----MTYSNFGICADGSLIP-VRPRNS-----SDNGI...HVQ SPIKE CVHNL/723-1356 NCTEPV-----LVYSNIGVCKSGSIGY-VPSQS------GQVKI...HVQ Q692M1 9CORO/740-1383 NCTEPA-----LVYSNIGVCKNGAIGL-VGIRN------TQPKI...HIQ Q0Q4F4 9CORO/729-1360 NCTSPR-----LVYSNIGVCTSGAIGL-LSPKX------AQPQI...HVQ Q0Q4F6 9CORO/743-1371 NCTNPV-----LTYSSYGVCPDGSITR-LGLTD------VQPHF...--T A4ULL0 9CORO/726-1328 NCTKPV-----LSYGPISVCSDGAIAG-TSTLQN-----TRPSI...KEW A6N263 9CORO/406-1035 ECDIPIGAGICASYHTVSLLRSTSQKSIVAYTMS------LGAD...HYT Q6T7X8 CVHSA/647-1255 ... DCE-PV-----ITYSNIGVCKNGAFVF-INVTH------SDGDV...HVH Q0PKZ5 CVPPU/797-1449 Eric Talevich IOB Workshop: Biopython
  • 25. Sequences and alignments NCBI EUtils and BLAST The Seq object Phylogenetics SeqIO and the SeqRecord object Protein structures Snack Time Eric Talevich IOB Workshop: Biopython
  • 26. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures EUtils and BLAST Eric Talevich IOB Workshop: Biopython
  • 27. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures EUtils: Entrez Programming Utilities Access NCBI’s online services: from Bio import Entrez Entrez.email = "you@uga.edu" Eric Talevich IOB Workshop: Biopython
  • 28. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures EUtils: Entrez Programming Utilities Access NCBI’s online services: from Bio import Entrez Entrez.email = "you@uga.edu" Request a GenBank record: handle = Entrez.efetch(db="protein", id="69316", rettype="gb", retmode="text") record = SeqIO.read(handle, "gb") Eric Talevich IOB Workshop: Biopython
  • 29. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures EUtils: Entrez Programming Utilities Access NCBI’s online services: from Bio import Entrez Entrez.email = "you@uga.edu" Request a GenBank record: handle = Entrez.efetch(db="protein", id="69316", rettype="gb", retmode="text") record = SeqIO.read(handle, "gb") Specify multiple IDs in one query: handle = Entrez.efetch(db="protein", id="349839,349840", rettype="fasta", retmode="text") records = SeqIO.parse(handle, "fasta") Eric Talevich IOB Workshop: Biopython
  • 30. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures Interlude: SeqRecord attributes seq: the sequence (Seq) itself id: primary ID for the sequence, e.g. accession number (string) name: “common” name/id for the sequence, like GenBank LOCUS id description: human-readible description of the sequence letter annotations: restricted dictionary of additional info about individual letters in the sequence, e.g. quality scores annotations: dictionary of additional unstructured info features: list of SeqFeature objects with more structured information — e.g. position of genes on a genome, domains on a protein sequence. dbxrefs: list of database cross-references (strings) Eric Talevich IOB Workshop: Biopython
  • 31. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures from Bio import E n t r e z , SeqIO E n t r e z . e m a i l = " me@uga .edu" h a n d l e = E n t r e z . e f e t c h ( db=" nucleotide " , i d=" M95169 " , r e t t y p e="gb" , r et m od e="text" ) r e c o r d = SeqIO . r e a d ( h a n d l e , " genbank " ) handle . c l o s e () print record print record . features [10] s l i c e d = record [20000:] # L a s t ˜25% o f t h e genome print s l i c e d from Bio . Seq import Seq from Bio . A l p h a b e t import g e n e r i c p r o t e i n t r a n s l a t i o n s = [ f . q u a l i f i e r s [ " translation " ] for f in record . f e a t u r e s [ 1 : ] ] p r o t e i n s = [ Seq ( t [ 0 ] , g e n e r i c p r o t e i n ) for t in t r a n s l a t i o n s ] Eric Talevich IOB Workshop: Biopython
  • 32. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures NCBI Blast BLAST can be used either standalone or through NCBI’s server. Online: >>> from Bio.Blast import NCBIWWW >>> result handle = NCBIWWW.qblast( ’blastp’, ’nr’, query string) Standalone: “Legacy” (blastall): >>> from Bio.Blast.Applications import BlastallCommandline >>> help(BlastallCommandline) New hotness (Blast+): >>> from Bio.Blast.Applications import NcbiblastpCommandline >>> help(NcbiblastpCommandline) Eric Talevich IOB Workshop: Biopython
  • 33. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures Parsing BLAST output BLAST produces reports in plain-text and XML format. Biopython requests XML by default. >>> from Bio.Blast import NCBIWWW, NCBIXML >>> result handle = NCBIWWW.qblast(’blastp’, ... ’nr’, query string) >>> blast record = NCBIXML.read(result handle) >>> print blast record Eric Talevich IOB Workshop: Biopython
  • 34. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures # S e a r c h f o r homologs o f a p r o t e i n s e q u e n c e from Bio import SeqIO from Bio . B l a s t import NBCIWWW, NCBIXML # Read and r e f o r m a t t h e q u e r y s e q u e n c e s e q r e c = SeqIO . r e a d ( ’gi2.gb ’ , ’gb ’ ) q u e r y = s e q r e c . f o r m a t ( ’fasta ’ ) # Submit an o n l i n e BLAST q u e r y # ( T h i s t a k e s some t i m e t o r u n ) r e s u l t h a n d l e = NCBIWWW. q b l a s t ( ’blastx ’ , ’nr ’ , q u e r y ) Eric Talevich IOB Workshop: Biopython
  • 35. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures # 1 . Save t h e BLAST r e s u l t s a s an XML f i l e w i t h open ( ’aprotinin_blast .xml ’ , ’w’ ) a s s a v e f i l e : s a v e f i l e . write ( r e s u l t h a n d l e . read ()) result handle . close () # NB : The BLAST r e s u l t h a n d l e can o n l y be r e a d once # R e l o a d i t from t h e f i l e w i t h open ( ’aprotinin_blast .xml ’ ) a s r e s u l t h a n d l e : b l a s t r e c o r d = NCBIXML . r e a d ( r e s u l t h a n d l e ) Eric Talevich IOB Workshop: Biopython
  • 36. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures # 2 . D i s p l a y a h i s t o g r a m o f BLAST h i t s c o r e s def g e t s c o r e s ( a l i g n m e n t s ) : for aln in alignments : f o r hsp i n a l n . h s p s : y i e l d hsp . s c o r e scores = l i s t ( get scores ( blast record . alignments )) # Draw t h e h i s t o g r a m import p y l a b p y l a b . h i s t ( s c o r e s , b i n s =20) p y l a b . t i t l e ( " Scores of %d BLAST hits" % l e n ( s c o r e s ) ) p y l a b . x l a b e l ( " BLAST score " ) p y l a b . y l a b e l ( "# hits" ) p y l a b . show ( ) # Save a copy f o r l a t e r p y l a b . s a v e f i g ( ’aprotinin_scores .png ’ ) Eric Talevich IOB Workshop: Biopython
  • 37. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures Figure: Histogram of BLAST scores generated by pylab Eric Talevich IOB Workshop: Biopython
  • 38. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures # 3 . E x t r a c t t h e s e q u e n c e s o f h i g h −s c o r i n g BLAST h i t s from Bio . Seq import Seq from Bio . SeqRecord import SeqRecord def g e t h s p s ( a l i g n m e n t s , t h r e s h o l d ) : for aln in alignments : f o r hsp i n a l n . h s p s : i f hsp . s c o r e >= t h r e s h o l d : y i e l d SeqRecord ( Seq ( hsp . s b j c t ) , i d=a l n . a c c e s s i o n ) break b e s t s e q s = g e t h s p s ( b l a s t r e c o r d . alignments , 321) SeqIO . w r i t e ( b e s t s e q s , ’aprotinin . fasta ’ , ’fasta ’ ) Eric Talevich IOB Workshop: Biopython
  • 39. Sequences and alignments EUtils: Entrez Programming Utilities NCBI EUtils and BLAST NCBI Blast Phylogenetics External programs Protein structures Calling other external programs Biopython has wrappers for other command-line programs in: Bio.Blast.Applications — the Blast+ suite Bio.Align.Applications — Muscle, ClustalW, . . . Bio.Emboss.Applications — needle, water, . . . Let’s re-align our BLAST results using Muscle, and format the alignment for use with stand-alone Phylip. Eric Talevich IOB Workshop: Biopython
  • 40. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures from Bio import A l i g n I O from Bio . A l i g n . A p p l i c a t i o n s import MuscleCommandline from S t r i n g I O import S t r i n g I O # C o n s t r u c t t h e s h e l l command muscle cmd = MuscleCommandline ( i n p u t=" aprotinin . fasta " ) # E x e c u t e t h e command # Get o u t p u t ( t h e a l i g n m e n t ) and any e r r o r m e s s a g e s m u s c l e o u t , m u s c l e e r r = muscle cmd ( ) # Read t h e a l i g n m e n t back i n a l i g n = A l i g n I O . r e a d ( S t r i n g I O ( m u s c l e o u t ) , " fasta " ) # Format t h e a l i g n m e n t f o r P h y l i p A l i g n I O . w r i t e ( [ a l i g n ] , ’aprotinin .phy ’ , ’phylip ’ ) Eric Talevich IOB Workshop: Biopython
  • 41. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Phylogenetics Eric Talevich IOB Workshop: Biopython
  • 42. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Phylogenetic tree I/O Start with: >>> from Bio import Phylo Input and output of trees is just like SeqIO: read, parse single or multiple trees in Newick, Nexus and PhyloXML formats write to any of the formats supported by read/parse convert between two formats in one step Use StringIO to load strings directly: >>> from cStringIO import StringIO >>> handle = StringIO("((A,B),(C,(D,E)));") >>> tree = Phylo.read(handle, "newick") Eric Talevich IOB Workshop: Biopython
  • 43. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures What’s in a tree? Make a tree with branch lengths: >>> tree = Phylo.read(StringIO("((A:1,B:1):2, ... (C:2,(D:1,E:1):1):1);"), "newick") View the object structure of the entire tree: >>> print tree Draw an “ASCII-art” (plain text) representation: >>> Phylo.draw ascii(tree) . . . OK, let’s do it properly now: >>> Phylo.draw(tree) Eric Talevich IOB Workshop: Biopython
  • 44. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Modify the tree Check the tree object for its methods: >>> help(tree) Try a few: >>> tree.get terminals() >>> clade = tree.common ancestor("A", "B") >>> clade.color = "red" >>> tree.root with outgroup("D", "E") >>> tree.ladderize() >>> Phylo.draw(tree) Eric Talevich IOB Workshop: Biopython
  • 45. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures External applications Biopython wraps a number of external programs for phylogenetics. We’re not going to use them now, but here’s where to find them: Bio.Phylo.PAML — PAML wrappers & helpers Bio.Phylo.Applications — command-line wrapper for PhyML (PhymlCommandline); RAxML and others on the way. (Anything you’d like to see sooner?) Bio.Emboss.Applications — other tools ported via Embassy, including Phylip Eric Talevich IOB Workshop: Biopython
  • 46. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Protein structures Eric Talevich IOB Workshop: Biopython
  • 47. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Going 3D: The PDB module Load a structure: >>> from Bio import PDB >>> parser = PDB.PDBParser() >>> struct = parser.get structure(’1ATP’, ’1ATP.pdb’) Eric Talevich IOB Workshop: Biopython
  • 48. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Going 3D: The PDB module Load a structure: >>> from Bio import PDB >>> parser = PDB.PDBParser() >>> struct = parser.get structure(’1ATP’, ’1ATP.pdb’) Inspect the object hierarchy: >>> list(struct) >>> model = struct[0] >>> list(model) >>> chain = model[’E’] >>> list(chain) >>> residue = chain[15] >>> list(residue) Eric Talevich IOB Workshop: Biopython
  • 49. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Figure: The “SMCRA” object hierarchy Eric Talevich IOB Workshop: Biopython
  • 50. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Extracting a peptide sequence Get the amino acid sequence through a Polypeptide object: >>> from Bio import PDB >>> parser = PDB.PDBParser() >>> struct = parser.get structure(’1ATP’, ... ’1ATP.pdb’) >>> ppb = PDB.PPBuilder() >>> peptides = ppb.build peptides(struct) >>> for pep in peptides: ... print pep.get sequence() Eric Talevich IOB Workshop: Biopython
  • 51. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Calculating RMSD Given two aligned structures, filter a list of target residues for high RMS deviation. Input: list of residue positions (integers) two equivalent chains from aligned protein models — residue numbers must match Minimum RMSD value (float) Output: list of residue positions, filtered Procedure: 1 Extract coordinates of Cα atoms 2 If available (not glycine), extract Cβ coordinates, too 3 Use Bio.SVDSuperimposer to calculate the RMSD between coordinates 4 Compare to the given RMSD threshold Eric Talevich IOB Workshop: Biopython
  • 52. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures from Bio . SVDSuperimposer import SVDSuperimposer from numpy import a r r a y def f i l t r m s ( r e s i d s , r e f c h a i n , cmpchain , t h r e s h = 0 . 5 ) : s u p e r = SVDSuperimposer ( ) for res in r e s i d s : refres = refchain [ res ] cmpres = cmpchain [ r e s ] c o o r d 1 = [ r e f r e s [ ’CA ’ ] . g e t c o o r d ( ) ] c o o r d 2 = [ cmpres [ ’CA ’ ] . g e t c o o r d ( ) ] i f r e f r e s . h a s i d ( ’CB ’ ) and cmpres . h a s i d ( ’CB ’ ) : # Not g l y c i n e c o o r d 1 . append ( r e f r e s [ ’CB ’ ] . g e t c o o r d ( ) ) c o o r d 2 . append ( c m p r e s [ ’CB ’ ] . g e t c o o r d ( ) ) super . s e t ( a r r a y ( coord1 ) , a r r a y ( coord2 )) rmsd = s u p e r . g e t i n i t r m s ( ) i f rmsd >= t h r e s h o l d : yield res Eric Talevich IOB Workshop: Biopython
  • 53. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Figure: Superimposed structures, with selected deviating residues Eric Talevich IOB Workshop: Biopython
  • 54. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Further reading Biopython tutorial: http: //biopython.org/DIST/docs/tutorial/Tutorial.html Biopython wiki: http://biopython.org/ This presentation: http://www.slideshare.net/etalevich/ biopython-programming-workshop-at-uga Eric Talevich IOB Workshop: Biopython
  • 55. Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures Thanks ’Preciate it. Gracias Eric Talevich IOB Workshop: Biopython