SlideShare a Scribd company logo
FBW
27-10-2015
Wim Van Criekinge
Bioinformatics.be
GitHub: Hosted GIT
• Largest open source git hosting site
• Public and private options
• User-centric rather than project-centric
• http://github.ugent.be (use your Ugent
login and password)
– Accept invitation from Bioinformatics-I-
2015
URI:
– https://github.ugent.be/Bioinformatics-I-
2015/Python.git
Control Structures
if condition:
statements
[elif condition:
statements] ...
else:
statements
while condition:
statements
for var in sequence:
statements
break
continue
Lists
• Flexible arrays, not Lisp-like linked
lists
• a = [99, "bottles of beer", ["on", "the",
"wall"]]
• Same operators as for strings
• a+b, a*3, a[0], a[-1], a[1:], len(a)
• Item and slice assignment
• a[0] = 98
• a[1:2] = ["bottles", "of", "beer"]
-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]
• del a[-1] # -> [98, "bottles", "of", "beer"]
Dictionaries
• Hash tables, "associative arrays"
• d = {"duck": "eend", "water": "water"}
• Lookup:
• d["duck"] -> "eend"
• d["back"] # raises KeyError exception
• Delete, insert, overwrite:
• del d["water"] # {"duck": "eend", "back": "rug"}
• d["back"] = "rug" # {"duck": "eend", "back":
"rug"}
• d["duck"] = "duik" # {"duck": "duik", "back":
"rug"}
Regex.py
text = 'abbaaabbbbaaaaa'
pattern = 'ab'
for match in re.finditer(pattern, text):
s = match.start()
e = match.end()
print ('Found "%s" at %d:%d' % (text[s:e], s, e))
m = re.search("^([A-Z]) ",line)
if m:
from_letter = m.groups()[0]
Question 3. Swiss-Knife.py
• Using a database as input ! Parse
the entire Swiss Prot collection
– How many entries are there ?
– Average Protein Length (in aa and
MW)
– Relative frequency of amino acids
• Compare to the ones used to construct
the PAM scoring matrixes from 1978 –
1991
Question 3: Getting the database
Uniprot_sprot.dat.gz – 528Mb
(save on your network drive H:)
Unzipped 2.92 Gb !
http://www.ebi.ac.uk/uniprot/download-center
Amino acid frequencies
1978 1991
L 0.085 0.091
A 0.087 0.077
G 0.089 0.074
S 0.070 0.069
V 0.065 0.066
E 0.050 0.062
T 0.058 0.059
K 0.081 0.059
I 0.037 0.053
D 0.047 0.052
R 0.041 0.051
P 0.051 0.051
N 0.040 0.043
Q 0.038 0.041
F 0.040 0.040
Y 0.030 0.032
M 0.015 0.024
H 0.034 0.023
C 0.033 0.020
W 0.010 0.014
Second step: Frequencies of Occurence
Extra Questions
• How many records have a sequence of length 260?
• What are the first 20 residues of 143X_MAIZE?
• What is the identifier for the record with the
shortest sequence? Is there more than one record
with that length?
• What is the identifier for the record with the
longest sequence? Is there more than one record
with that length?
• How many contain the subsequence "ARRA"?
• How many contain the substring "KCIP-1" in the
description?
Perl / Python 00
• A class is a package
• An object is a reference to a data
structure (usually a hash) in a class
• A method is a subroutine in the class
Biopython functionality and tools
• The ability to parse bioinformatics files into Python
utilizable data structures
• Support the following formats:
– Blast output
– Clustalw
– FASTA
– PubMed and Medline
– ExPASy files
– SCOP
– SwissProt
– PDB
• Files in the supported formats can be iterated over
record by record or indexed and accessed via a
dictionary interface
Biopython functionality and tools
• Code to deal with on-line bioinformatics destinations (NCBI,
ExPASy)
• Interface to common bioinformatics programs (Blast,
ClustalW)
• A sequence obj dealing with seqs, seq IDs, seq features
• Tools for operations on sequences
• Tools for dealing with alignments
• Tools to manage protein structures
• Tools to run applications
Install Biopython
The Biopython module name is Bio
It must be downloaded and installed
(http://biopython.org/wiki/Download)
You need to install numpy first
>>>import Bio
Install Biopython
pip is the preferred installer program.
Starting with Python 3.4, it is included
by default with the Python binary
installers.
pip3.5 install Biopython
#pip3.5 install yahoo_finance
from yahoo_finance import Share
yahoo = Share('AAPL')
print (yahoo.get_open())
Run Install.py (is BioPython installed ?)
import pip
import sys
import platform
import webbrowser
print ("Python " + platform.python_version()+ " installed
packages:")
installed_packages = pip.get_installed_distributions()
installed_packages_list = sorted(["%s==%s" % (i.key, i.version)
for i in installed_packages])
print(*installed_packages_list,sep="n")
BioPython
• Make a histogram of the MW (in kDa) of all proteins in
Swiss-Prot
• Find the most basic and most acidic protein in Swiss-Prot?
• Biological relevance of the results ?
From AAIndex
H ZIMJ680104
D Isoelectric point (Zimmerman et al., 1968)
R LIT:2004109b PMID:5700434
A Zimmerman, J.M., Eliezer, N. and Simha, R.
T The characterization of amino acid sequences in proteins by
statistical
methods
J J. Theor. Biol. 21, 170-201 (1968)
C KLEP840101 0.941 FAUJ880111 0.813 FINA910103 0.805
I A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V
6.00 10.76 5.41 2.77 5.05 5.65 3.22 5.97 7.59 6.02
5.98 9.74 5.74 5.48 6.30 5.68 5.66 5.89 5.66 5.96
• Introduction to Biopython
– Sequence objects (I)
– Sequence Record objects (I)
– Protein structures (PDB module) (II)
• Working with DNA and protein sequences
– Transcription and Translation
• Extracting information from biological resources
– Parsing Swiss-Prot files (I)
– Parsing BLAST output (I)
– Accessing NCBI’s Entrez databases (II)
– Parsing Medline records (II)
• Running external applications (e.g. BLAST) locally and from a
script
– Running BLAST over the Internet
– Running BLAST locally
• Working with motifs
– Parsing PROSITE records
– Parsing PROSITE documentation records
Introduction to Biopython (I)
• Sequence objects
• Sequence Record objects
Sequence Object
• Seq objects vs Python strings:
– They have different methods
– The Seq object has the attribute alphabet
(biological meaning of Seq)
>>> import Bio
>>> from Bio.Seq import Seq
>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> print my_seq
Seq('AGTACACTGGT', Alphabet())
>>> my_seq.alphabet
Alphabet()
>>>
The alphabet attribute
• Alphabets are defined in the Bio.Alphabet module
• We will use the IUPAC alphabets
(http://www.chem.qmw.ac.uk/iupac)
• Bio.Alphabet.IUPAC provides definitions for DNA, RNA and
proteins + provides extension and customization of basic
definitions:
– IUPACProtein (IUPAC standard AA)
– ExtendedIUPACProtein (+ selenocysteine, X,
etc)
– IUPACUnambiguousDNA (basic GATC letters)
– IUPACAmbiguousDNA (+ ambiguity letters)
– ExtendedIUPACDNA (+ modified bases)
– IUPACUnambiguousRNA
– IUPACAmbiguousRNA
>>> import Bio
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna)
>>> my_seq
Seq('AGTACACTGGT', IUPACUnambiguousDNA())
>>> my_seq.alphabet
IUPACUnambiguousDNA()
>>> my_seq = Seq("AGTACACTGGT", IUPAC.protein)
>>> my_seq
Seq('AGTACACTGGT', IUPACProtein())
>>> my_seq.alphabet
IUPACProtein()
>>>
The alphabet attribute
>>> my_seq = Seq("AGTAACCCTTAGCACTGGT", IUPAC.unambiguous_dna)
>>> for index, letter in enumerate(my_seq):
... print index, letter
...
0 A
1 G
2 T
3 A
4 A
5 C
...etc
>>> print len(my_seq)
19
>>> print my_seq[0]
A
>>> print my_seq[2:10]
Seq('TAACCCTT', IUPACProtein())
>>> my_seq.count('A')
5
>>> 100*float(my_seq.count('C')+my_seq.count('G'))/len(my_seq)
47.368421052631582
Sequences act like strings
>>> my_seq = Seq("AGTAACCCTTAGCACTGGT", IUPAC.unambiguous_dna)
>>>>>> str(my_seq)
'AGTAACCCTTAGCACTGGT'
>>> print my_seq
AGTAACCCTTAGCACTGGT
>>> fasta_format_string = ">DNA_idn%sn"% my_seq
>>> print fasta_format_string
>DNA_id
AGTAACCCTTAGCACTGGT
# Biopython 1.44 or older
>>>my_seq.tostring()
'AGTAACCCTTAGCACTGGT'
Turn Seq objects into strings
You may need the plain sequence string (e.g. to write to a file or to insert
into a database)
>>> dna_seq = Seq("AGTAACCCTTAGCACTGGT", IUPAC.unambiguous_dna)
>>> protein_seq = Seq("KSMKPPRTHLIMHWIIL", IUPAC.IUPACProtein())
>>> protein_seq + dna_seq
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/home/abarbato/biopython-1.53/build/lib.linux-x86_64-
2.4/Bio/Seq.py", line 216, in __add__
raise TypeError("Incompatable alphabets %s and %s" 
TypeError: Incompatable alphabets IUPACProtein() and
IUPACUnambiguousDNA()
BUT, if you give generic alphabet to dna_seq and protein_seq:
>>> from Bio.Alphabet import generic_alphabet
>>> dna_seq.alphabet = generic_alphabet
>>> protein_seq.alphabet = generic_alphabet
>>> protein_seq + dna_seq
Seq('KSMKPPRTHLIMHWIILAGTAACCCTTAGCACTGGT', Alphabet())
Concatenating sequences
You can’t add sequences with incompatible alphabets (protein sequence
and DNA sequence)
>>> from Bio.Alphabet import generic_dna
>>> dna_seq = Seq("acgtACGT", generic_dna)
>>> dna_seq.upper()
Seq('ACGTACGT', DNAAlphabet())
>>> dna_seq.lower()
Seq('acgtacgt', DNAAlphabet())
>>>
Changing case
Seq objects have upper() and lower() methods
Note that the IUPAC alphabets are for upper case only
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> dna_seq = Seq("AGTAACCCTTAGCACTGGT", IUPAC.unambiguous_dna)
>>> dna_seq.complement()
Seq('TCATTGGGAATCGTGACCA', IUPACUnambiguousDNA())
>>> dna_seq.reverse_complement()
Seq('ACCAGTGCTAAGGGTTACT', IUPACUnambiguousDNA())
Nucleotide sequences and (reverse) complements
Seq objects have upper() and lower() methods
Note that these operations are not allowed with protein
alphabets
Transcription
Transcription
>>> coding_dna =
Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
IUPAC.unambiguous_dna)
>>> template_dna = coding_dna.reverse_complement()
>>> template_dna
Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT',
IUPACUnambiguousDNA())
>>> messenger_rna = coding_dna.transcribe()
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG',
IUPACUnambiguousRNA())
>>> messenger_rna.back_transcribe()
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG',
IUPACUnambiguousDNA())
Note: all this does is a switch T --> U and adjust the alphabet.
The Seq object also includes a back-transcription method:
Translation
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> messenger_rna = Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG',
IUPAC.unambiguous_rna)
>>> messenger_rna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
>>>
You can also translate directly from the coding strand DNA sequence
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
IUPAC.unambiguous_dna)
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
>>>
Translation with different translation tables
>>> coding_dna.translate(table="Vertebrate Mitochondrial")
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(to_stop = True)
Seq('MAIVMGR', IUPACProtein())
>>> coding_dna.translate(table=2,to_stop = True)
Seq('MAIVMGRWKGAR', IUPACProtein())
Translation tables available in Biopython are based on those from the NCBI.
By default, translation will use the standard genetic code (NCBI table id 1)
If you deal with mitochondrial sequences:
If you want to translate the nucleotides up to the first in frame stop, and
then stop (as happens in nature):
Translation tables
>>> from Bio.Data import CodonTable
>>> standard_table =
CodonTable.unambiguous_dna_by_name["Standard"]
>>> mito_table =
CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
#Using the NCB table ids:
>>>standard_table = CodonTable.unambiguous_dna_by_id[1]
>>> mito_table = CodonTable.unambiguous_dna_by_id[2]
Translation tables available in Biopython are based on those from the NCBI.
By default, translation will use the standard genetic code (NCBI table id 1)
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
Translation tables
>>>print standard_table
Table 1 Standard, SGC0
| T | C | A | G |
--+---------+---------+---------+---------+--
T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S | TAG Stop| TGG W | G
--+---------+---------+---------+---------+--
C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L(s)| CCG P | CAG Q | CGG R | G
--+---------+---------+---------+---------+--
A | ATT I | ACT T | AAT N | AGT S | T
A | ATC I | ACC T | AAC N | AGC S | C
A | ATA I | ACA T | AAA K | AGA R | A
A | ATG M(s)| ACG T | AAG K | AGG R | G
--+---------+---------+---------+---------+--
G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V | GCG A | GAG E | GGG G | G
--+---------+---------+---------+---------+--
Translation tables
>>> print mito_table
Table 2 Vertebrate Mitochondrial, SGC1
| T | C | A | G |
--+---------+---------+---------+---------+--
T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA W | A
T | TTG L | TCG S | TAG Stop| TGG W | G
--+---------+---------+---------+---------+--
C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L | CCG P | CAG Q | CGG R | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T | AAT N | AGT S | T
A | ATC I(s)| ACC T | AAC N | AGC S | C
A | ATA M(s)| ACA T | AAA K | AGA Stop| A
A | ATG M(s)| ACG T | AAG K | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V(s)| GCG A | GAG E | GGG G | G
--+---------+---------+---------+---------+--
MutableSeq objects
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq =
Seq('CGCGCGGGTTTATGATGACCCAAATATAGAGGGCACAC',
IUPAC.unambiguous_dna)
>>> my_seq[5] = 'A'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support item assignment
>>>
Like Python strings, Seq objects are immutable
However, you can convert it into a mutable sequence (a MutableSeq object)
>>> mutable_seq = my_seq.tomutable()
>>> mutable_seq
MutableSeq('CGCGCGGGTTTATGATGACCCAAATATAGAGGGCACAC',
IUPACUnambiguousDNA())
MutableSeq objects
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> mutable_seq =
MutableSeq('CGCGCGGGTTTATGATGACCCAAATATAGAGGGCACAC',
IUPAC.unambiguous_dna)
>>> mutable_seq[5] = 'A'
>>> mutable_seq
MutableSeq('CGCGCAGGTTTATGATGACCCAAATATAGAGGGCACAC',
IUPACUnambiguousDNA())
You can create a mutable object directly
A MutableSeq object can be easily converted into a read-only sequence:
>>> new_seq = mutable_seq.toseq()
>>> new_seq
Seq('CGCGCAGGTTTATGATGACCCAAATATAGAGGGCACAC',
IUPACUnambiguousDNA())
Sequence Record objects
The SeqRecord class is defined in the Bio.SeqRecord module
This class allows higher level features such as identifiers and features to be
associated with a sequence
>>> from Bio.SeqRecord import SeqRecord
>>> help(SeqRecord)
class SeqRecord(__builtin__.object)
A SeqRecord object holds a sequence and information about it.
Main attributes:
id - Identifier such as a locus tag (string)
seq - The sequence itself (Seq object or similar)
Additional attributes:
name - Sequence name, e.g. gene name (string)
description - Additional text (string)
dbxrefs - List of db cross references (list of strings)
features - Any (sub)features defined (list of SeqFeature objects)
annotations - Further information about the whole sequence (dictionary)
Most entries are strings, or lists of strings.
letter_annotations -
Per letter/symbol annotation (restricted dictionary). This holds
Python sequences (lists, strings or tuples) whose length
matches that of the sequence. A typical use would be to hold a
list of integers representing sequencing quality scores, or a
string representing the secondary structure.
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> TMP = Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF’)
>>> TMP_r = SeqRecord(TMP)
>>> TMP_r.id
'<unknown id>'
>>> TMP_r.id = 'YP_025292.1'
>>> TMP_r.description = 'toxic membrane protein'
>>> print TMP_r
ID: YP_025292.1
Name: <unknown name>
Description: toxic membrane protein
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF',
Alphabet())
>>> TMP_r.seq
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF',
You will typically use Bio.SeqIO to read in sequences from files as
SeqRecord objects. However, you may want to create your own SeqRecord
objects directly:
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Alphabet import IUPAC
>>> record
SeqRecord(seq=Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQ
TEVAVF', IUPACProtein()), id='YP_025292.1', name='HokC',
description='toxic membrane protein', dbxrefs=[])
>>> print record
ID: YP_025292.1
Name: HokC
Description: toxic membrane protein
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF',
IUPACProtein())
>>>
You can also create your own SeqRecord objects as follows:
The format() method
It returns a string containing your cord formatted using one of the output
file formats supported by Bio.SeqIO
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Alphabet import generic_protein
>>> rec =
SeqRecord(Seq("MGSNKSKPKDASQRRRSLEPSENVHGAGGAFPASQTPSKPASADGHRGPSA
AFVPPAAEPKLFGGFNSSDTVTSPQRAGALAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTR
KVDVREGDWWLAHSLSTGQTGYIPS", generic_protein), id = "P05480",
description = "SRC_MOUSE Neuronal proto-oncogene tyrosine-protein
kinase Src: MY TEST")
>>> print rec.format("fasta")
>P05480 SRC_MOUSE Neuronal proto-oncogene tyrosine-protein kinase
Src: MY TEST
MGSNKSKPKDASQRRRSLEPSENVHGAGGAFPASQTPSKPASADGHRGPSAAFVPPAAEP
KLFGGFNSSDTVTSPQRAGALAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTRKVD
VREGDWWLAHSLSTGQTGYIPS
INPUT FILE
SCRIPT.py
OUTPUT FILE
Seq1 “ACTGGGAGCTAGC”
Seq2 “TTGATCGATCGATCG”
Seq3 “GTGTAGCTGCT”
F = open(“input.txt”)
for line in F:
<parse line>
<get seq id>
<get description>
<get sequence>
<get other info>
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein
rec = SeqRecord(Seq(<sequence>, alphabet),id
= <seq_id>, description = <description>)
Format_rec = rec.format(“fasta”)
Out.write(Format_rec)
>P05480 SRC_MOUSE Neuronal proto-oncogene tyrosine-protein
kinase Src: MY TEST
MGSNKSKPKDASQRRRSLEPSENVHGAGGAFPASQTPSKPASADGHRGPSAAFVPPAAEP
KLFGGFNSSDTVTSPQRAGALAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTRKVD
Extracting information from biological resources:
parsing Swiss-Prot files, PDB files, ENSEMBLE records,
blast output files, etc.
• Sequence I/O
– Parsing or Reading Sequences
– Writing Sequence Files
A simple interface for working with assorted file formats in a uniform way
>>>from Bio import SeqIO
>>>help(SeqIO)
Bio.SeqIO
Bio.SeqIO.parse()
• A handle to read the data form. It can be:
– a file opened for reading
– the output from a command line program
– data downloaded from the internet
• A lower case string specifying the sequence format (see
http://biopython.org/wiki/SeqIO for a full listing of supported
formats).
Reads in sequence data as SeqRecord objects.
It expects two arguments.
The object returned by Bio.SeqIO is an iterator which returns SeqRecord
objects
>>> from Bio import SeqIO
>>> handle = open("P05480.fasta")
>>> for seq_rec in SeqIO.parse(handle, "fasta"):
... print seq_rec.id
... print repr(seq_rec.seq)
... print len(seq_rec)
...
sp|P05480|SRC_MOUSE
Seq('MGSNKSKPKDASQRRRSLERGPSA...ENL', SingleLetterAlphabet())
541
>>> handle.close()
>>> for seq_rec in SeqIO.parse(handle, "genbank"):
... print seq_rec.id
... print repr(seq_rec.seq)
... print len(seq_rec)
...
U49845.1
Seq('GATCCTCCATATACAACGGTACGGAA...ATC', IUPACAmbiguousDNA())
5028
>>> handle.close()
>>> from Bio import SeqIO
>>> handle = open("AP006852.gbk")
>>> for seq_rec in SeqIO.parse(handle, "genbank"):
... print seq_rec.id
... print repr(seq_rec.seq)
... print len(seq_rec)
...
AP006852.1
Seq('CCACTGTCCAATACCCCCAACAGGAAT...TGT', IUPACAmbiguousDNA())
949626
>>>
>>>handle = open("AP006852.gbk")
>>>identifiers=[seq_rec.id for seq_rec in SeqIO.parse(handle,"genbank")]
>>>handle.close()
>>>identifiers
['AP006852.1']
>>>
Candida albicans genomic DNA, chromosome 7, complete sequence
Using list comprehension:
>>> from Bio import SeqIO
>>> handle = open("sprot_prot.fasta")
>>> ids = [seq_rec.id for seq_rec in SeqIO.parse(handle,"fasta")]
>>> ids
['sp|P24928|RPB1_HUMAN', 'sp|Q9NVU0|RPC5_HUMAN',
'sp|Q9BUI4|RPC3_HUMAN', 'sp|Q9BUI4|RPC3_HUMAN',
'sp|Q9NW08|RPC2_HUMAN', 'sp|Q9H1D9|RPC6_HUMAN',
'sp|P19387|RPB3_HUMAN', 'sp|O14802|RPC1_HUMAN',
'sp|P52435|RPB11_HUMAN', 'sp|O15318|RPC7_HUMAN',
'sp|P62487|RPB7_HUMAN', 'sp|O15514|RPB4_HUMAN',
'sp|Q9GZS1|RPA49_HUMAN', 'sp|P36954|RPB9_HUMAN',
'sp|Q9Y535|RPC8_HUMAN', 'sp|O95602|RPA1_HUMAN',
'sp|Q9Y2Y1|RPC10_HUMAN', 'sp|Q9H9Y6|RPA2_HUMAN',
'sp|P78527|PRKDC_HUMAN', 'sp|O15160|RPAC1_HUMAN',…,
'sp|Q9BWH6|RPAP1_HUMAN']
>>> ]
Here we do it using the sprot_prot.fasta file
Iterating over the records in a sequence file
Instead of using a for loop, you can also use the next() method of an
iterator to step through the entries
>>> handle = open("sprot_prot.fasta")
>>> rec_iter = SeqIO.parse(handle, "fasta")
>>> rec_1 = rec_iter.next()
>>> rec_1
SeqRecord(seq=Seq('MHGGGPPSGDSACPLRTIKRVQFGVLSPDELKRMSVTEGGIKYPET
TEGGRPKL...EEN', SingleLetterAlphabet()),
id='sp|P24928|RPB1_HUMAN', name='sp|P24928|RPB1_HUMAN',
description='sp|P24928|RPB1_HUMAN DNA-directed RNA polymerase II
subunit RPB1 OS=Homo sapiens GN=POLR2A PE=1 SV=2', dbxrefs=[])
>>> rec_2 = rec_iter.next()
>>> rec_2
SeqRecord(seq=Seq('MANEEDDPVVQEIDVYLAKSLAEKLYLFQYPVRPASMTYDDIPHLS
AKIKPKQQ...VQS', SingleLetterAlphabet()),
id='sp|Q9NVU0|RPC5_HUMAN', name='sp|Q9NVU0|RPC5_HUMAN',
description='sp|Q9NVU0|RPC5_HUMAN DNA-directed RNA polymerase III
subunit RPC5 OS=Homo sapiens GN=POLR3E PE=1 SV=1', dbxrefs=[])
>>> handle.close()
If your file has one and only one record (e.g. a GenBank file for a single
chromosome), then use the Bio.SeqIO.read().
This will check there are no extra unexpected records present
Bio.SeqIO.read()
>>> rec_iter = SeqIO.parse(open("1293613.gbk"), "genbank")
>>> rec = rec_iter.next()
>>> print rec
ID: U49845.1
Name: SCU49845
Description: Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds.
Number of features: 6
/sequence_version=1
/source=Saccharomyces cerevisiae (baker's yeast)
/taxonomy=['Eukaryota', 'Fungi', 'Ascomycota', 'Saccharomycotina',
'Saccharomycetes', 'Saccharomycetales', 'Saccharomycetaceae', 'Saccharomyces']
/keywords=['']
/references=[Reference(title='Cloning and sequence of REV7, a gene whose function
is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae', ...),
Reference(title='Selection of axial growth sites in yeast requires Axl2p, a novel
plasma membrane glycoprotein', ...), Reference(title='Direct Submission', ...)]
/accessions=['U49845']
/data_file_division=PLN
/date=21-JUN-1999
/organism=Saccharomyces cerevisiae
/gi=1293613
Seq('GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAA...ATC',
IUPACAmbiguousDNA())
Sequence files as lists
Sequence files as dictionaries
>>> from Bio import SeqIO
>>> handle = open("ncbi_gene.fasta")
>>> records = list(SeqIO.parse(handle, "fasta"))
>>> >>> records[-1]
SeqRecord(seq=Seq('gggggggggggggggggatcactctctttcagtaacctcaac...c
cc', SingleLetterAlphabet()), id='A10421', name='A10421',
description='A10421 Synthetic nucleotide sequence having a human
IL-2 gene obtained from pILOT135-8. : Location:1..1000',
dbxrefs=[])
>>> handle = open("ncbi_gene.fasta")
>>> records = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
>>> handle.close()
>>> records.keys()
['M69013', 'M69012', 'AJ580952', 'J03005', 'J03004', 'L13858',
'L04510', 'M94539', 'M19650', 'A10421', 'AJ002990', 'A06663',
'A06662', 'S62035', 'M57424', 'M90035', 'A06280', 'X95521',
'X95520', 'M28269', 'S50017', 'L13857', 'AJ345013', 'M31328',
'AB038040', 'AB020593', 'M17219', 'DQ854814', 'M27543', 'X62025',
'M90043', 'L22075', 'X56614', 'M90027']
>>> seq_record = records['X95521']
'X95521 M.musculus mRNA for cyclic nucleotide phosphodiesterase :
Location:1..1000'
Parsing sequences from the net
Parsing GenBank records from the net
Parsing SwissProt sequence from the net
Handles are not always from files
>>>from Bio import Entrez
>>>from Bio import SeqIO
>>>handle = Entrez.efetch(db="nucleotide",rettype="fasta",id="6273291")
>>>seq_record = SeqIO.read(handle,”fasta”)
>>>handle.close()
>>>seq_record.description
>>>from Bio import ExPASy
>>>from Bio import SeqIO
>>>handle = ExPASy.get_sprot_raw("6273291")
>>>seq_record = SeqIO.read(handle,”swiss”)
>>>handle.close()
>>>print seq_record.id
>>>print seq_record.name
>>>prin seq_record.description
Indexing really large files
Bio.SeqIO.index() returns a dictionary without keeping
everything in memory.
It works fine even for million of sequences
The main drawback is less flexibility: it is read-only
>>> from Bio import SeqIO
>>> recs_dict = SeqIO.index("ncbi_gene.fasta", "fasta")
>>> len(recs_dict)
34
>>> recs_dict.keys()
['M69013', 'M69012', 'AJ580952', 'J03005', 'J03004', 'L13858', 'L04510',
'M94539', 'M19650', 'A10421', 'AJ002990', 'A06663', 'A06662', 'S62035',
'M57424', 'M90035', 'A06280', 'X95521', 'X95520', 'M28269', 'S50017',
'L13857', 'AJ345013', 'M31328', 'AB038040', 'AB020593', 'M17219', 'DQ854814',
'M27543', 'X62025', 'M90043', 'L22075', 'X56614', 'M90027']
>>> print recs_dict['M57424']
ID: M57424
Name: M57424
Description: M57424 Human adenine nucleotide translocator-2 (ANT-2) gene,
complete cds. : Location:1..1000
Number of features: 0
Seq('gagctctggaatagaatacagtagaggcatcatgctcaaagagagtagcagatg...agc',
SingleLetterAlphabet())
Writing sequence files
Bio.SeqIO.write()
This function takes three arguments:
1. some SeqRecord objects
2. a handle to write to
3. a sequence format
from Bio.Seq import Seq
from Bio.SeqRecors import SeqRecord
from Bio.Alphabet import generic_protein
Rec1 = SqRecord(Seq(“ACCA…”,generic_protein), id=“1”, description=“”)
Rec1 = SqRecord(Seq(“CDRFAA”,generic_protein), id=“2”, description=“”)
Rec1 = SqRecord(Seq(“GRKLM”,generic_protein), id=“3”, description=“”)
My_records = [Rec1, Rec2, Rec3]
from Bio import SeqIO
handle = open(“MySeqs.fas”,”w”)
SeqIO.write(My_records, handle, “fasta”)
handle.close()
Converting between sequence file formats
We can do file conversion by combining Bio.SeqIO.parse()
and Bio.SeqIO.write()
from Bio import SeqIO
>>> In_handle = open ("AP006852.gbk", "r")
>>> Out_handle = open("AP006852.fasta", "w")
>>> records = SeqIO.parse(In_handle, "genbank")
>>> count = SeqIO.write(records, Out_handle, "fasta")
>>> count
1
>>>
>>> In_handle.close()
>>> Out_handle.close()

More Related Content

What's hot

HADOOP 실제 구성 사례, Multi-Node 구성
HADOOP 실제 구성 사례, Multi-Node 구성HADOOP 실제 구성 사례, Multi-Node 구성
HADOOP 실제 구성 사례, Multi-Node 구성
Young Pyo
 
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedVmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Adrian Huang
 
Hadoop 20111215
Hadoop 20111215Hadoop 20111215
Hadoop 20111215
exsuns
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELK
I Goo Lee
 
UNIX Basics and Cluster Computing
UNIX Basics and Cluster ComputingUNIX Basics and Cluster Computing
UNIX Basics and Cluster Computing
Bioinformatics and Computational Biosciences Branch
 
Hadoop installation
Hadoop installationHadoop installation
Hadoop installation
Ankit Desai
 
Hadoop single cluster installation
Hadoop single cluster installationHadoop single cluster installation
Hadoop single cluster installation
Minh Tran
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
EDB
 
eZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisitedeZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisited
Bertrand Dunogier
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
Adrian Huang
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Started
abramsm
 
Advanced Replication
Advanced ReplicationAdvanced Replication
Advanced Replication
MongoDB
 
Replica Sets (NYC NoSQL Meetup)
Replica Sets (NYC NoSQL Meetup)Replica Sets (NYC NoSQL Meetup)
Replica Sets (NYC NoSQL Meetup)
MongoDB
 
MongoDB Database Replication
MongoDB Database ReplicationMongoDB Database Replication
MongoDB Database Replication
Mehdi Valikhani
 
Automating Disaster Recovery PostgreSQL
Automating Disaster Recovery PostgreSQLAutomating Disaster Recovery PostgreSQL
Automating Disaster Recovery PostgreSQL
Nina Kaufman
 
Node.js and websockets intro
Node.js and websockets introNode.js and websockets intro
Node.js and websockets intro
kompozer
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...
Adrian Huang
 
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeSCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
Jeff Frost
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
Adrian Huang
 
glance replicator
glance replicatorglance replicator
glance replicator
irix_jp
 

What's hot (20)

HADOOP 실제 구성 사례, Multi-Node 구성
HADOOP 실제 구성 사례, Multi-Node 구성HADOOP 실제 구성 사례, Multi-Node 구성
HADOOP 실제 구성 사례, Multi-Node 구성
 
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedVmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
 
Hadoop 20111215
Hadoop 20111215Hadoop 20111215
Hadoop 20111215
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELK
 
UNIX Basics and Cluster Computing
UNIX Basics and Cluster ComputingUNIX Basics and Cluster Computing
UNIX Basics and Cluster Computing
 
Hadoop installation
Hadoop installationHadoop installation
Hadoop installation
 
Hadoop single cluster installation
Hadoop single cluster installationHadoop single cluster installation
Hadoop single cluster installation
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
eZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisitedeZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisited
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Started
 
Advanced Replication
Advanced ReplicationAdvanced Replication
Advanced Replication
 
Replica Sets (NYC NoSQL Meetup)
Replica Sets (NYC NoSQL Meetup)Replica Sets (NYC NoSQL Meetup)
Replica Sets (NYC NoSQL Meetup)
 
MongoDB Database Replication
MongoDB Database ReplicationMongoDB Database Replication
MongoDB Database Replication
 
Automating Disaster Recovery PostgreSQL
Automating Disaster Recovery PostgreSQLAutomating Disaster Recovery PostgreSQL
Automating Disaster Recovery PostgreSQL
 
Node.js and websockets intro
Node.js and websockets introNode.js and websockets intro
Node.js and websockets intro
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...
 
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeSCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
glance replicator
glance replicatorglance replicator
glance replicator
 

Similar to 2015 bioinformatics bio_python

2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge
Prof. Wim Van Criekinge
 
¡El mejor lenguaje para automatizar pruebas!
¡El mejor lenguaje para automatizar pruebas!¡El mejor lenguaje para automatizar pruebas!
¡El mejor lenguaje para automatizar pruebas!
Antonio Robres Turon
 
How to Connect SystemVerilog with Octave
How to Connect SystemVerilog with OctaveHow to Connect SystemVerilog with Octave
How to Connect SystemVerilog with Octave
Amiq Consulting
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
Jakub Hajek
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
PROIDEA
 
OpenStack API's and WSGI
OpenStack API's and WSGIOpenStack API's and WSGI
OpenStack API's and WSGI
Mike Pittaro
 
Pycon taiwan 2018_claudiu_popa
Pycon taiwan 2018_claudiu_popaPycon taiwan 2018_claudiu_popa
Pycon taiwan 2018_claudiu_popa
Claudiu Popa
 
PyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web ApplicationsPyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web Applications
Graham Dumpleton
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
zeeg
 
Go Web Development
Go Web DevelopmentGo Web Development
Go Web Development
Cheng-Yi Yu
 
Porting a legacy app to python 3
Porting a legacy app to python 3Porting a legacy app to python 3
Porting a legacy app to python 3
Mark Rees
 
SDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLSSDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLS
NECST Lab @ Politecnico di Milano
 
Web2py Code Lab
Web2py Code LabWeb2py Code Lab
Web2py Code Lab
Colin Su
 
Database Firewall with Snort
Database Firewall with SnortDatabase Firewall with Snort
Database Firewall with Snort
Narudom Roongsiriwong, CISSP
 
2016 bioinformatics i_io_wim_vancriekinge
2016 bioinformatics i_io_wim_vancriekinge2016 bioinformatics i_io_wim_vancriekinge
2016 bioinformatics i_io_wim_vancriekinge
Prof. Wim Van Criekinge
 
Labs_BT_20221017.pptx
Labs_BT_20221017.pptxLabs_BT_20221017.pptx
Labs_BT_20221017.pptx
ssuserb4d806
 
Down the rabbit hole, profiling in Django
Down the rabbit hole, profiling in DjangoDown the rabbit hole, profiling in Django
Down the rabbit hole, profiling in Django
Remco Wendt
 
PyCon Taiwan 2013 Tutorial
PyCon Taiwan 2013 TutorialPyCon Taiwan 2013 Tutorial
PyCon Taiwan 2013 Tutorial
Justin Lin
 
Advanced Python Tutorial | Learn Advanced Python Concepts | Python Programmin...
Advanced Python Tutorial | Learn Advanced Python Concepts | Python Programmin...Advanced Python Tutorial | Learn Advanced Python Concepts | Python Programmin...
Advanced Python Tutorial | Learn Advanced Python Concepts | Python Programmin...
Edureka!
 
#WeSpeakLinux Session
#WeSpeakLinux Session#WeSpeakLinux Session
#WeSpeakLinux Session
Kellyn Pot'Vin-Gorman
 

Similar to 2015 bioinformatics bio_python (20)

2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge
 
¡El mejor lenguaje para automatizar pruebas!
¡El mejor lenguaje para automatizar pruebas!¡El mejor lenguaje para automatizar pruebas!
¡El mejor lenguaje para automatizar pruebas!
 
How to Connect SystemVerilog with Octave
How to Connect SystemVerilog with OctaveHow to Connect SystemVerilog with Octave
How to Connect SystemVerilog with Octave
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
OpenStack API's and WSGI
OpenStack API's and WSGIOpenStack API's and WSGI
OpenStack API's and WSGI
 
Pycon taiwan 2018_claudiu_popa
Pycon taiwan 2018_claudiu_popaPycon taiwan 2018_claudiu_popa
Pycon taiwan 2018_claudiu_popa
 
PyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web ApplicationsPyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web Applications
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
Go Web Development
Go Web DevelopmentGo Web Development
Go Web Development
 
Porting a legacy app to python 3
Porting a legacy app to python 3Porting a legacy app to python 3
Porting a legacy app to python 3
 
SDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLSSDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLS
 
Web2py Code Lab
Web2py Code LabWeb2py Code Lab
Web2py Code Lab
 
Database Firewall with Snort
Database Firewall with SnortDatabase Firewall with Snort
Database Firewall with Snort
 
2016 bioinformatics i_io_wim_vancriekinge
2016 bioinformatics i_io_wim_vancriekinge2016 bioinformatics i_io_wim_vancriekinge
2016 bioinformatics i_io_wim_vancriekinge
 
Labs_BT_20221017.pptx
Labs_BT_20221017.pptxLabs_BT_20221017.pptx
Labs_BT_20221017.pptx
 
Down the rabbit hole, profiling in Django
Down the rabbit hole, profiling in DjangoDown the rabbit hole, profiling in Django
Down the rabbit hole, profiling in Django
 
PyCon Taiwan 2013 Tutorial
PyCon Taiwan 2013 TutorialPyCon Taiwan 2013 Tutorial
PyCon Taiwan 2013 Tutorial
 
Advanced Python Tutorial | Learn Advanced Python Concepts | Python Programmin...
Advanced Python Tutorial | Learn Advanced Python Concepts | Python Programmin...Advanced Python Tutorial | Learn Advanced Python Concepts | Python Programmin...
Advanced Python Tutorial | Learn Advanced Python Concepts | Python Programmin...
 
#WeSpeakLinux Session
#WeSpeakLinux Session#WeSpeakLinux Session
#WeSpeakLinux Session
 

More from Prof. Wim Van Criekinge

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
Prof. Wim Van Criekinge
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
Prof. Wim Van Criekinge
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
Prof. Wim Van Criekinge
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
Prof. Wim Van Criekinge
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
Prof. Wim Van Criekinge
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
Prof. Wim Van Criekinge
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
Prof. Wim Van Criekinge
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
Prof. Wim Van Criekinge
 
P4 2018 io_functions
P4 2018 io_functionsP4 2018 io_functions
P4 2018 io_functions
Prof. Wim Van Criekinge
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
Prof. Wim Van Criekinge
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
Prof. Wim Van Criekinge
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
Prof. Wim Van Criekinge
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
Prof. Wim Van Criekinge
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
Prof. Wim Van Criekinge
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
Prof. Wim Van Criekinge
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
Prof. Wim Van Criekinge
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
Prof. Wim Van Criekinge
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
Prof. Wim Van Criekinge
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
Prof. Wim Van Criekinge
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
Prof. Wim Van Criekinge
 

More from Prof. Wim Van Criekinge (20)

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
 
P4 2018 io_functions
P4 2018 io_functionsP4 2018 io_functions
P4 2018 io_functions
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
 

Recently uploaded

Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
math operations ued in python and all used
math operations ued in python and all usedmath operations ued in python and all used
math operations ued in python and all used
ssuser13ffe4
 
Constructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective CommunicationConstructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective Communication
Chevonnese Chevers Whyte, MBA, B.Sc.
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Himanshu Rai
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
Nguyen Thanh Tu Collection
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
siemaillard
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
Wahiba Chair Training & Consulting
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
Celine George
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 

Recently uploaded (20)

Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
math operations ued in python and all used
math operations ued in python and all usedmath operations ued in python and all used
math operations ued in python and all used
 
Constructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective CommunicationConstructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective Communication
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 

2015 bioinformatics bio_python

  • 1.
  • 4. GitHub: Hosted GIT • Largest open source git hosting site • Public and private options • User-centric rather than project-centric • http://github.ugent.be (use your Ugent login and password) – Accept invitation from Bioinformatics-I- 2015 URI: – https://github.ugent.be/Bioinformatics-I- 2015/Python.git
  • 5. Control Structures if condition: statements [elif condition: statements] ... else: statements while condition: statements for var in sequence: statements break continue
  • 6. Lists • Flexible arrays, not Lisp-like linked lists • a = [99, "bottles of beer", ["on", "the", "wall"]] • Same operators as for strings • a+b, a*3, a[0], a[-1], a[1:], len(a) • Item and slice assignment • a[0] = 98 • a[1:2] = ["bottles", "of", "beer"] -> [98, "bottles", "of", "beer", ["on", "the", "wall"]] • del a[-1] # -> [98, "bottles", "of", "beer"]
  • 7. Dictionaries • Hash tables, "associative arrays" • d = {"duck": "eend", "water": "water"} • Lookup: • d["duck"] -> "eend" • d["back"] # raises KeyError exception • Delete, insert, overwrite: • del d["water"] # {"duck": "eend", "back": "rug"} • d["back"] = "rug" # {"duck": "eend", "back": "rug"} • d["duck"] = "duik" # {"duck": "duik", "back": "rug"}
  • 8. Regex.py text = 'abbaaabbbbaaaaa' pattern = 'ab' for match in re.finditer(pattern, text): s = match.start() e = match.end() print ('Found "%s" at %d:%d' % (text[s:e], s, e)) m = re.search("^([A-Z]) ",line) if m: from_letter = m.groups()[0]
  • 9. Question 3. Swiss-Knife.py • Using a database as input ! Parse the entire Swiss Prot collection – How many entries are there ? – Average Protein Length (in aa and MW) – Relative frequency of amino acids • Compare to the ones used to construct the PAM scoring matrixes from 1978 – 1991
  • 10. Question 3: Getting the database Uniprot_sprot.dat.gz – 528Mb (save on your network drive H:) Unzipped 2.92 Gb ! http://www.ebi.ac.uk/uniprot/download-center
  • 11. Amino acid frequencies 1978 1991 L 0.085 0.091 A 0.087 0.077 G 0.089 0.074 S 0.070 0.069 V 0.065 0.066 E 0.050 0.062 T 0.058 0.059 K 0.081 0.059 I 0.037 0.053 D 0.047 0.052 R 0.041 0.051 P 0.051 0.051 N 0.040 0.043 Q 0.038 0.041 F 0.040 0.040 Y 0.030 0.032 M 0.015 0.024 H 0.034 0.023 C 0.033 0.020 W 0.010 0.014 Second step: Frequencies of Occurence
  • 12. Extra Questions • How many records have a sequence of length 260? • What are the first 20 residues of 143X_MAIZE? • What is the identifier for the record with the shortest sequence? Is there more than one record with that length? • What is the identifier for the record with the longest sequence? Is there more than one record with that length? • How many contain the subsequence "ARRA"? • How many contain the substring "KCIP-1" in the description?
  • 13. Perl / Python 00 • A class is a package • An object is a reference to a data structure (usually a hash) in a class • A method is a subroutine in the class
  • 14.
  • 15.
  • 16. Biopython functionality and tools • The ability to parse bioinformatics files into Python utilizable data structures • Support the following formats: – Blast output – Clustalw – FASTA – PubMed and Medline – ExPASy files – SCOP – SwissProt – PDB • Files in the supported formats can be iterated over record by record or indexed and accessed via a dictionary interface
  • 17. Biopython functionality and tools • Code to deal with on-line bioinformatics destinations (NCBI, ExPASy) • Interface to common bioinformatics programs (Blast, ClustalW) • A sequence obj dealing with seqs, seq IDs, seq features • Tools for operations on sequences • Tools for dealing with alignments • Tools to manage protein structures • Tools to run applications
  • 18. Install Biopython The Biopython module name is Bio It must be downloaded and installed (http://biopython.org/wiki/Download) You need to install numpy first >>>import Bio
  • 19. Install Biopython pip is the preferred installer program. Starting with Python 3.4, it is included by default with the Python binary installers. pip3.5 install Biopython #pip3.5 install yahoo_finance from yahoo_finance import Share yahoo = Share('AAPL') print (yahoo.get_open())
  • 20. Run Install.py (is BioPython installed ?) import pip import sys import platform import webbrowser print ("Python " + platform.python_version()+ " installed packages:") installed_packages = pip.get_installed_distributions() installed_packages_list = sorted(["%s==%s" % (i.key, i.version) for i in installed_packages]) print(*installed_packages_list,sep="n")
  • 21. BioPython • Make a histogram of the MW (in kDa) of all proteins in Swiss-Prot • Find the most basic and most acidic protein in Swiss-Prot? • Biological relevance of the results ? From AAIndex H ZIMJ680104 D Isoelectric point (Zimmerman et al., 1968) R LIT:2004109b PMID:5700434 A Zimmerman, J.M., Eliezer, N. and Simha, R. T The characterization of amino acid sequences in proteins by statistical methods J J. Theor. Biol. 21, 170-201 (1968) C KLEP840101 0.941 FAUJ880111 0.813 FINA910103 0.805 I A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V 6.00 10.76 5.41 2.77 5.05 5.65 3.22 5.97 7.59 6.02 5.98 9.74 5.74 5.48 6.30 5.68 5.66 5.89 5.66 5.96
  • 22. • Introduction to Biopython – Sequence objects (I) – Sequence Record objects (I) – Protein structures (PDB module) (II) • Working with DNA and protein sequences – Transcription and Translation • Extracting information from biological resources – Parsing Swiss-Prot files (I) – Parsing BLAST output (I) – Accessing NCBI’s Entrez databases (II) – Parsing Medline records (II) • Running external applications (e.g. BLAST) locally and from a script – Running BLAST over the Internet – Running BLAST locally • Working with motifs – Parsing PROSITE records – Parsing PROSITE documentation records
  • 23. Introduction to Biopython (I) • Sequence objects • Sequence Record objects
  • 24. Sequence Object • Seq objects vs Python strings: – They have different methods – The Seq object has the attribute alphabet (biological meaning of Seq) >>> import Bio >>> from Bio.Seq import Seq >>> my_seq = Seq("AGTACACTGGT") >>> my_seq Seq('AGTACACTGGT', Alphabet()) >>> print my_seq Seq('AGTACACTGGT', Alphabet()) >>> my_seq.alphabet Alphabet() >>>
  • 25. The alphabet attribute • Alphabets are defined in the Bio.Alphabet module • We will use the IUPAC alphabets (http://www.chem.qmw.ac.uk/iupac) • Bio.Alphabet.IUPAC provides definitions for DNA, RNA and proteins + provides extension and customization of basic definitions: – IUPACProtein (IUPAC standard AA) – ExtendedIUPACProtein (+ selenocysteine, X, etc) – IUPACUnambiguousDNA (basic GATC letters) – IUPACAmbiguousDNA (+ ambiguity letters) – ExtendedIUPACDNA (+ modified bases) – IUPACUnambiguousRNA – IUPACAmbiguousRNA
  • 26. >>> import Bio >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna) >>> my_seq Seq('AGTACACTGGT', IUPACUnambiguousDNA()) >>> my_seq.alphabet IUPACUnambiguousDNA() >>> my_seq = Seq("AGTACACTGGT", IUPAC.protein) >>> my_seq Seq('AGTACACTGGT', IUPACProtein()) >>> my_seq.alphabet IUPACProtein() >>> The alphabet attribute
  • 27. >>> my_seq = Seq("AGTAACCCTTAGCACTGGT", IUPAC.unambiguous_dna) >>> for index, letter in enumerate(my_seq): ... print index, letter ... 0 A 1 G 2 T 3 A 4 A 5 C ...etc >>> print len(my_seq) 19 >>> print my_seq[0] A >>> print my_seq[2:10] Seq('TAACCCTT', IUPACProtein()) >>> my_seq.count('A') 5 >>> 100*float(my_seq.count('C')+my_seq.count('G'))/len(my_seq) 47.368421052631582 Sequences act like strings
  • 28. >>> my_seq = Seq("AGTAACCCTTAGCACTGGT", IUPAC.unambiguous_dna) >>>>>> str(my_seq) 'AGTAACCCTTAGCACTGGT' >>> print my_seq AGTAACCCTTAGCACTGGT >>> fasta_format_string = ">DNA_idn%sn"% my_seq >>> print fasta_format_string >DNA_id AGTAACCCTTAGCACTGGT # Biopython 1.44 or older >>>my_seq.tostring() 'AGTAACCCTTAGCACTGGT' Turn Seq objects into strings You may need the plain sequence string (e.g. to write to a file or to insert into a database)
  • 29. >>> dna_seq = Seq("AGTAACCCTTAGCACTGGT", IUPAC.unambiguous_dna) >>> protein_seq = Seq("KSMKPPRTHLIMHWIIL", IUPAC.IUPACProtein()) >>> protein_seq + dna_seq Traceback (most recent call last): File "<stdin>", line 1, in ? File "/home/abarbato/biopython-1.53/build/lib.linux-x86_64- 2.4/Bio/Seq.py", line 216, in __add__ raise TypeError("Incompatable alphabets %s and %s" TypeError: Incompatable alphabets IUPACProtein() and IUPACUnambiguousDNA() BUT, if you give generic alphabet to dna_seq and protein_seq: >>> from Bio.Alphabet import generic_alphabet >>> dna_seq.alphabet = generic_alphabet >>> protein_seq.alphabet = generic_alphabet >>> protein_seq + dna_seq Seq('KSMKPPRTHLIMHWIILAGTAACCCTTAGCACTGGT', Alphabet()) Concatenating sequences You can’t add sequences with incompatible alphabets (protein sequence and DNA sequence)
  • 30. >>> from Bio.Alphabet import generic_dna >>> dna_seq = Seq("acgtACGT", generic_dna) >>> dna_seq.upper() Seq('ACGTACGT', DNAAlphabet()) >>> dna_seq.lower() Seq('acgtacgt', DNAAlphabet()) >>> Changing case Seq objects have upper() and lower() methods Note that the IUPAC alphabets are for upper case only
  • 31. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> dna_seq = Seq("AGTAACCCTTAGCACTGGT", IUPAC.unambiguous_dna) >>> dna_seq.complement() Seq('TCATTGGGAATCGTGACCA', IUPACUnambiguousDNA()) >>> dna_seq.reverse_complement() Seq('ACCAGTGCTAAGGGTTACT', IUPACUnambiguousDNA()) Nucleotide sequences and (reverse) complements Seq objects have upper() and lower() methods Note that these operations are not allowed with protein alphabets
  • 33. Transcription >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>> template_dna = coding_dna.reverse_complement() >>> template_dna Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT', IUPACUnambiguousDNA()) >>> messenger_rna = coding_dna.transcribe() >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) >>> messenger_rna.back_transcribe() Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA()) Note: all this does is a switch T --> U and adjust the alphabet. The Seq object also includes a back-transcription method:
  • 34. Translation >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> messenger_rna = Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPAC.unambiguous_rna) >>> messenger_rna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) >>> You can also translate directly from the coding strand DNA sequence >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>> coding_dna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) >>>
  • 35. Translation with different translation tables >>> coding_dna.translate(table="Vertebrate Mitochondrial") Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(table=2) Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(to_stop = True) Seq('MAIVMGR', IUPACProtein()) >>> coding_dna.translate(table=2,to_stop = True) Seq('MAIVMGRWKGAR', IUPACProtein()) Translation tables available in Biopython are based on those from the NCBI. By default, translation will use the standard genetic code (NCBI table id 1) If you deal with mitochondrial sequences: If you want to translate the nucleotides up to the first in frame stop, and then stop (as happens in nature):
  • 36. Translation tables >>> from Bio.Data import CodonTable >>> standard_table = CodonTable.unambiguous_dna_by_name["Standard"] >>> mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"] #Using the NCB table ids: >>>standard_table = CodonTable.unambiguous_dna_by_id[1] >>> mito_table = CodonTable.unambiguous_dna_by_id[2] Translation tables available in Biopython are based on those from the NCBI. By default, translation will use the standard genetic code (NCBI table id 1) http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
  • 37. Translation tables >>>print standard_table Table 1 Standard, SGC0 | T | C | A | G | --+---------+---------+---------+---------+-- T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA Stop| A T | TTG L(s)| TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-- C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L(s)| CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-- A | ATT I | ACT T | AAT N | AGT S | T A | ATC I | ACC T | AAC N | AGC S | C A | ATA I | ACA T | AAA K | AGA R | A A | ATG M(s)| ACG T | AAG K | AGG R | G --+---------+---------+---------+---------+-- G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V | GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+--
  • 38. Translation tables >>> print mito_table Table 2 Vertebrate Mitochondrial, SGC1 | T | C | A | G | --+---------+---------+---------+---------+-- T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA W | A T | TTG L | TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-- C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L | CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-- A | ATT I(s)| ACT T | AAT N | AGT S | T A | ATC I(s)| ACC T | AAC N | AGC S | C A | ATA M(s)| ACA T | AAA K | AGA Stop| A A | ATG M(s)| ACG T | AAG K | AGG Stop| G --+---------+---------+---------+---------+-- G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V(s)| GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+--
  • 39. MutableSeq objects >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq('CGCGCGGGTTTATGATGACCCAAATATAGAGGGCACAC', IUPAC.unambiguous_dna) >>> my_seq[5] = 'A' Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: object does not support item assignment >>> Like Python strings, Seq objects are immutable However, you can convert it into a mutable sequence (a MutableSeq object) >>> mutable_seq = my_seq.tomutable() >>> mutable_seq MutableSeq('CGCGCGGGTTTATGATGACCCAAATATAGAGGGCACAC', IUPACUnambiguousDNA())
  • 40. MutableSeq objects >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> mutable_seq = MutableSeq('CGCGCGGGTTTATGATGACCCAAATATAGAGGGCACAC', IUPAC.unambiguous_dna) >>> mutable_seq[5] = 'A' >>> mutable_seq MutableSeq('CGCGCAGGTTTATGATGACCCAAATATAGAGGGCACAC', IUPACUnambiguousDNA()) You can create a mutable object directly A MutableSeq object can be easily converted into a read-only sequence: >>> new_seq = mutable_seq.toseq() >>> new_seq Seq('CGCGCAGGTTTATGATGACCCAAATATAGAGGGCACAC', IUPACUnambiguousDNA())
  • 41. Sequence Record objects The SeqRecord class is defined in the Bio.SeqRecord module This class allows higher level features such as identifiers and features to be associated with a sequence >>> from Bio.SeqRecord import SeqRecord >>> help(SeqRecord)
  • 42. class SeqRecord(__builtin__.object) A SeqRecord object holds a sequence and information about it. Main attributes: id - Identifier such as a locus tag (string) seq - The sequence itself (Seq object or similar) Additional attributes: name - Sequence name, e.g. gene name (string) description - Additional text (string) dbxrefs - List of db cross references (list of strings) features - Any (sub)features defined (list of SeqFeature objects) annotations - Further information about the whole sequence (dictionary) Most entries are strings, or lists of strings. letter_annotations - Per letter/symbol annotation (restricted dictionary). This holds Python sequences (lists, strings or tuples) whose length matches that of the sequence. A typical use would be to hold a list of integers representing sequencing quality scores, or a string representing the secondary structure.
  • 43. >>> from Bio.Seq import Seq >>> from Bio.SeqRecord import SeqRecord >>> TMP = Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF’) >>> TMP_r = SeqRecord(TMP) >>> TMP_r.id '<unknown id>' >>> TMP_r.id = 'YP_025292.1' >>> TMP_r.description = 'toxic membrane protein' >>> print TMP_r ID: YP_025292.1 Name: <unknown name> Description: toxic membrane protein Number of features: 0 Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', Alphabet()) >>> TMP_r.seq Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', You will typically use Bio.SeqIO to read in sequences from files as SeqRecord objects. However, you may want to create your own SeqRecord objects directly:
  • 44. >>> from Bio.Seq import Seq >>> from Bio.SeqRecord import SeqRecord >>> from Bio.Alphabet import IUPAC >>> record SeqRecord(seq=Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQ TEVAVF', IUPACProtein()), id='YP_025292.1', name='HokC', description='toxic membrane protein', dbxrefs=[]) >>> print record ID: YP_025292.1 Name: HokC Description: toxic membrane protein Number of features: 0 Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein()) >>> You can also create your own SeqRecord objects as follows:
  • 45. The format() method It returns a string containing your cord formatted using one of the output file formats supported by Bio.SeqIO >>> from Bio.Seq import Seq >>> from Bio.SeqRecord import SeqRecord >>> from Bio.Alphabet import generic_protein >>> rec = SeqRecord(Seq("MGSNKSKPKDASQRRRSLEPSENVHGAGGAFPASQTPSKPASADGHRGPSA AFVPPAAEPKLFGGFNSSDTVTSPQRAGALAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTR KVDVREGDWWLAHSLSTGQTGYIPS", generic_protein), id = "P05480", description = "SRC_MOUSE Neuronal proto-oncogene tyrosine-protein kinase Src: MY TEST") >>> print rec.format("fasta") >P05480 SRC_MOUSE Neuronal proto-oncogene tyrosine-protein kinase Src: MY TEST MGSNKSKPKDASQRRRSLEPSENVHGAGGAFPASQTPSKPASADGHRGPSAAFVPPAAEP KLFGGFNSSDTVTSPQRAGALAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTRKVD VREGDWWLAHSLSTGQTGYIPS
  • 46. INPUT FILE SCRIPT.py OUTPUT FILE Seq1 “ACTGGGAGCTAGC” Seq2 “TTGATCGATCGATCG” Seq3 “GTGTAGCTGCT” F = open(“input.txt”) for line in F: <parse line> <get seq id> <get description> <get sequence> <get other info> from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_protein rec = SeqRecord(Seq(<sequence>, alphabet),id = <seq_id>, description = <description>) Format_rec = rec.format(“fasta”) Out.write(Format_rec) >P05480 SRC_MOUSE Neuronal proto-oncogene tyrosine-protein kinase Src: MY TEST MGSNKSKPKDASQRRRSLEPSENVHGAGGAFPASQTPSKPASADGHRGPSAAFVPPAAEP KLFGGFNSSDTVTSPQRAGALAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTRKVD
  • 47. Extracting information from biological resources: parsing Swiss-Prot files, PDB files, ENSEMBLE records, blast output files, etc. • Sequence I/O – Parsing or Reading Sequences – Writing Sequence Files A simple interface for working with assorted file formats in a uniform way >>>from Bio import SeqIO >>>help(SeqIO) Bio.SeqIO
  • 48. Bio.SeqIO.parse() • A handle to read the data form. It can be: – a file opened for reading – the output from a command line program – data downloaded from the internet • A lower case string specifying the sequence format (see http://biopython.org/wiki/SeqIO for a full listing of supported formats). Reads in sequence data as SeqRecord objects. It expects two arguments. The object returned by Bio.SeqIO is an iterator which returns SeqRecord objects
  • 49. >>> from Bio import SeqIO >>> handle = open("P05480.fasta") >>> for seq_rec in SeqIO.parse(handle, "fasta"): ... print seq_rec.id ... print repr(seq_rec.seq) ... print len(seq_rec) ... sp|P05480|SRC_MOUSE Seq('MGSNKSKPKDASQRRRSLERGPSA...ENL', SingleLetterAlphabet()) 541 >>> handle.close() >>> for seq_rec in SeqIO.parse(handle, "genbank"): ... print seq_rec.id ... print repr(seq_rec.seq) ... print len(seq_rec) ... U49845.1 Seq('GATCCTCCATATACAACGGTACGGAA...ATC', IUPACAmbiguousDNA()) 5028 >>> handle.close()
  • 50. >>> from Bio import SeqIO >>> handle = open("AP006852.gbk") >>> for seq_rec in SeqIO.parse(handle, "genbank"): ... print seq_rec.id ... print repr(seq_rec.seq) ... print len(seq_rec) ... AP006852.1 Seq('CCACTGTCCAATACCCCCAACAGGAAT...TGT', IUPACAmbiguousDNA()) 949626 >>> >>>handle = open("AP006852.gbk") >>>identifiers=[seq_rec.id for seq_rec in SeqIO.parse(handle,"genbank")] >>>handle.close() >>>identifiers ['AP006852.1'] >>> Candida albicans genomic DNA, chromosome 7, complete sequence Using list comprehension:
  • 51. >>> from Bio import SeqIO >>> handle = open("sprot_prot.fasta") >>> ids = [seq_rec.id for seq_rec in SeqIO.parse(handle,"fasta")] >>> ids ['sp|P24928|RPB1_HUMAN', 'sp|Q9NVU0|RPC5_HUMAN', 'sp|Q9BUI4|RPC3_HUMAN', 'sp|Q9BUI4|RPC3_HUMAN', 'sp|Q9NW08|RPC2_HUMAN', 'sp|Q9H1D9|RPC6_HUMAN', 'sp|P19387|RPB3_HUMAN', 'sp|O14802|RPC1_HUMAN', 'sp|P52435|RPB11_HUMAN', 'sp|O15318|RPC7_HUMAN', 'sp|P62487|RPB7_HUMAN', 'sp|O15514|RPB4_HUMAN', 'sp|Q9GZS1|RPA49_HUMAN', 'sp|P36954|RPB9_HUMAN', 'sp|Q9Y535|RPC8_HUMAN', 'sp|O95602|RPA1_HUMAN', 'sp|Q9Y2Y1|RPC10_HUMAN', 'sp|Q9H9Y6|RPA2_HUMAN', 'sp|P78527|PRKDC_HUMAN', 'sp|O15160|RPAC1_HUMAN',…, 'sp|Q9BWH6|RPAP1_HUMAN'] >>> ] Here we do it using the sprot_prot.fasta file
  • 52. Iterating over the records in a sequence file Instead of using a for loop, you can also use the next() method of an iterator to step through the entries >>> handle = open("sprot_prot.fasta") >>> rec_iter = SeqIO.parse(handle, "fasta") >>> rec_1 = rec_iter.next() >>> rec_1 SeqRecord(seq=Seq('MHGGGPPSGDSACPLRTIKRVQFGVLSPDELKRMSVTEGGIKYPET TEGGRPKL...EEN', SingleLetterAlphabet()), id='sp|P24928|RPB1_HUMAN', name='sp|P24928|RPB1_HUMAN', description='sp|P24928|RPB1_HUMAN DNA-directed RNA polymerase II subunit RPB1 OS=Homo sapiens GN=POLR2A PE=1 SV=2', dbxrefs=[]) >>> rec_2 = rec_iter.next() >>> rec_2 SeqRecord(seq=Seq('MANEEDDPVVQEIDVYLAKSLAEKLYLFQYPVRPASMTYDDIPHLS AKIKPKQQ...VQS', SingleLetterAlphabet()), id='sp|Q9NVU0|RPC5_HUMAN', name='sp|Q9NVU0|RPC5_HUMAN', description='sp|Q9NVU0|RPC5_HUMAN DNA-directed RNA polymerase III subunit RPC5 OS=Homo sapiens GN=POLR3E PE=1 SV=1', dbxrefs=[]) >>> handle.close()
  • 53. If your file has one and only one record (e.g. a GenBank file for a single chromosome), then use the Bio.SeqIO.read(). This will check there are no extra unexpected records present Bio.SeqIO.read() >>> rec_iter = SeqIO.parse(open("1293613.gbk"), "genbank") >>> rec = rec_iter.next() >>> print rec ID: U49845.1 Name: SCU49845 Description: Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. Number of features: 6 /sequence_version=1 /source=Saccharomyces cerevisiae (baker's yeast) /taxonomy=['Eukaryota', 'Fungi', 'Ascomycota', 'Saccharomycotina', 'Saccharomycetes', 'Saccharomycetales', 'Saccharomycetaceae', 'Saccharomyces'] /keywords=[''] /references=[Reference(title='Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae', ...), Reference(title='Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein', ...), Reference(title='Direct Submission', ...)] /accessions=['U49845'] /data_file_division=PLN /date=21-JUN-1999 /organism=Saccharomyces cerevisiae /gi=1293613 Seq('GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAA...ATC', IUPACAmbiguousDNA())
  • 54. Sequence files as lists Sequence files as dictionaries >>> from Bio import SeqIO >>> handle = open("ncbi_gene.fasta") >>> records = list(SeqIO.parse(handle, "fasta")) >>> >>> records[-1] SeqRecord(seq=Seq('gggggggggggggggggatcactctctttcagtaacctcaac...c cc', SingleLetterAlphabet()), id='A10421', name='A10421', description='A10421 Synthetic nucleotide sequence having a human IL-2 gene obtained from pILOT135-8. : Location:1..1000', dbxrefs=[]) >>> handle = open("ncbi_gene.fasta") >>> records = SeqIO.to_dict(SeqIO.parse(handle, "fasta")) >>> handle.close() >>> records.keys() ['M69013', 'M69012', 'AJ580952', 'J03005', 'J03004', 'L13858', 'L04510', 'M94539', 'M19650', 'A10421', 'AJ002990', 'A06663', 'A06662', 'S62035', 'M57424', 'M90035', 'A06280', 'X95521', 'X95520', 'M28269', 'S50017', 'L13857', 'AJ345013', 'M31328', 'AB038040', 'AB020593', 'M17219', 'DQ854814', 'M27543', 'X62025', 'M90043', 'L22075', 'X56614', 'M90027'] >>> seq_record = records['X95521'] 'X95521 M.musculus mRNA for cyclic nucleotide phosphodiesterase : Location:1..1000'
  • 55. Parsing sequences from the net Parsing GenBank records from the net Parsing SwissProt sequence from the net Handles are not always from files >>>from Bio import Entrez >>>from Bio import SeqIO >>>handle = Entrez.efetch(db="nucleotide",rettype="fasta",id="6273291") >>>seq_record = SeqIO.read(handle,”fasta”) >>>handle.close() >>>seq_record.description >>>from Bio import ExPASy >>>from Bio import SeqIO >>>handle = ExPASy.get_sprot_raw("6273291") >>>seq_record = SeqIO.read(handle,”swiss”) >>>handle.close() >>>print seq_record.id >>>print seq_record.name >>>prin seq_record.description
  • 56. Indexing really large files Bio.SeqIO.index() returns a dictionary without keeping everything in memory. It works fine even for million of sequences The main drawback is less flexibility: it is read-only >>> from Bio import SeqIO >>> recs_dict = SeqIO.index("ncbi_gene.fasta", "fasta") >>> len(recs_dict) 34 >>> recs_dict.keys() ['M69013', 'M69012', 'AJ580952', 'J03005', 'J03004', 'L13858', 'L04510', 'M94539', 'M19650', 'A10421', 'AJ002990', 'A06663', 'A06662', 'S62035', 'M57424', 'M90035', 'A06280', 'X95521', 'X95520', 'M28269', 'S50017', 'L13857', 'AJ345013', 'M31328', 'AB038040', 'AB020593', 'M17219', 'DQ854814', 'M27543', 'X62025', 'M90043', 'L22075', 'X56614', 'M90027'] >>> print recs_dict['M57424'] ID: M57424 Name: M57424 Description: M57424 Human adenine nucleotide translocator-2 (ANT-2) gene, complete cds. : Location:1..1000 Number of features: 0 Seq('gagctctggaatagaatacagtagaggcatcatgctcaaagagagtagcagatg...agc', SingleLetterAlphabet())
  • 57. Writing sequence files Bio.SeqIO.write() This function takes three arguments: 1. some SeqRecord objects 2. a handle to write to 3. a sequence format from Bio.Seq import Seq from Bio.SeqRecors import SeqRecord from Bio.Alphabet import generic_protein Rec1 = SqRecord(Seq(“ACCA…”,generic_protein), id=“1”, description=“”) Rec1 = SqRecord(Seq(“CDRFAA”,generic_protein), id=“2”, description=“”) Rec1 = SqRecord(Seq(“GRKLM”,generic_protein), id=“3”, description=“”) My_records = [Rec1, Rec2, Rec3] from Bio import SeqIO handle = open(“MySeqs.fas”,”w”) SeqIO.write(My_records, handle, “fasta”) handle.close()
  • 58. Converting between sequence file formats We can do file conversion by combining Bio.SeqIO.parse() and Bio.SeqIO.write() from Bio import SeqIO >>> In_handle = open ("AP006852.gbk", "r") >>> Out_handle = open("AP006852.fasta", "w") >>> records = SeqIO.parse(In_handle, "genbank") >>> count = SeqIO.write(records, Out_handle, "fasta") >>> count 1 >>> >>> In_handle.close() >>> Out_handle.close()