FBW
20-10-2015
Wim Van Criekinge
Bioinformatics.be
Overview
What is Python ?
Why Python 4 Bioinformatics ?
How to Python
IDE: Eclipse & PyDev / Athena
Code Sharing: Git(hub)
Strings
Regular expressions
Python
• Programming languages are overrated
– If you are going into bioinformatics you probably
learn/need multiple
– If you know one you know 90% of a second
• Choice does matter but it matters far less than people think it
does
• Why Python?
– Lets you start useful programs asap
– Build-in libraries – incl BioPython
– Free, most platforms, widely (scientifically) used
• Versus Perl?
– Incredibly similar
– Consistent syntax, indentation
Version 2.7 and 3.4 on athena.ugent.be
Where is the workspace ?
GitHub: Hosted GIT
• Largest open source git hosting site
• Public and private options
• User-centric rather than project-centric
• http://github.ugent.be (use your Ugent
login and password)
– Accept invitation from Bioinformatics-I-
2015
URI:
– https://github.ugent.be/Bioinformatics-I-
2015/Python.git
Run Install.py (is BioPython installed ?)
import pip
import sys
import platform
import webbrowser
print ("Python " + platform.python_version()+ " installed
packages:")
installed_packages = pip.get_installed_distributions()
installed_packages_list = sorted(["%s==%s" % (i.key, i.version)
for i in installed_packages])
print(*installed_packages_list,sep="n")
Control Structures
if condition:
statements
[elif condition:
statements] ...
else:
statements
while condition:
statements
for var in sequence:
statements
break
continue
Lists
• Flexible arrays, not Lisp-like linked
lists
• a = [99, "bottles of beer", ["on", "the",
"wall"]]
• Same operators as for strings
• a+b, a*3, a[0], a[-1], a[1:], len(a)
• Item and slice assignment
• a[0] = 98
• a[1:2] = ["bottles", "of", "beer"]
-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]
• del a[-1] # -> [98, "bottles", "of", "beer"]
Dictionaries
• Hash tables, "associative arrays"
• d = {"duck": "eend", "water": "water"}
• Lookup:
• d["duck"] -> "eend"
• d["back"] # raises KeyError exception
• Delete, insert, overwrite:
• del d["water"] # {"duck": "eend", "back": "rug"}
• d["back"] = "rug" # {"duck": "eend", "back":
"rug"}
• d["duck"] = "duik" # {"duck": "duik", "back":
"rug"}
Regex.py
text = 'abbaaabbbbaaaaa'
pattern = 'ab'
for match in re.finditer(pattern, text):
s = match.start()
e = match.end()
print ('Found "%s" at %d:%d' % (text[s:e], s, e))
Find the answer in ultimate-sequence.txt ?
>ultimate-sequence
ACTCGTTATGATATTTTTTTTGAACGTGAAAATACT
TTTCGTGCTATGGAAGGACTCGTTATCGTGAAGT
TGAACGTTCTGAATGTATGCCTCTTGAAATGGA
AAATACTCATTGTTTATCTGAAATTTGAATGGGA
ATTTTATCTACAATGTTTTATTCTTACAGAACAT
TAAATTGTGTTATGTTTCATTTCACATTTTAGTA
GTTTTTTCAGTGAAAGCTTGAAAACCACCAAGA
AGAAAAGCTGGTATGCGTAGCTATGTATATATA
AAATTAGATTTTCCACAAAAAATGATCTGATAA
ACCTTCTCTGTTGGCTCCAAGTATAAGTACGAAA
AGAAATACGTTCCCAAGAATTAGCTTCATGAGT
AAGAAGAAAAGCTGGTATGCGTAGCTATGTATA
TATAAAATTAGATTTTCCACAAAAAATGATCTG
ATAA
Question 2
AA1 =
{'UUU':'F','UUC':'F','UUA':'L','UUG':'L','UCU':'S','
UCC':'S','UCA':'S','UCG':'S','UAU':'Y','UAC':'Y','UA
A':'*','UAG':'*','UGU':'C','UGC':'C','UGA':'*','UGG':
'W','CUU':'L','CUC':'L','CUA':'L','CUG':'L','CCU':'P',
'CCC':'P','CCA':'P','CCG':'P','CAU':'H','CAC':'H','CA
A':'Q','CAG':'Q','CGU':'R','CGC':'R','CGA':'R','CGG'
:'R','AUU':'I','AUC':'I','AUA':'I','AUG':'M','ACU':'T','
ACC':'T','ACA':'T','ACG':'T','AAU':'N','AAC':'N','AAA'
:'K','AAG':'K','AGU':'S','AGC':'S','AGA':'R','AGG':'R',
'GUU':'V','GUC':'V','GUA':'V','GUG':'V','GCU':'A','G
CC':'A','GCA':'A','GCG':'A','GAU':'D','GAC':'D','GA
A':'E','GAG':'E','GGU':'G','GGC':'G','GGA':'G','GGG
':'G' }
Hint: Use Dictionaries
Hint 2: Translations
Python way:
tab = str.maketrans("ACGU","UGCA")
sequence = sequence.translate(tab)[::-1]
17
Reading Files
name = open("filename")
– opens the given file for reading, and returns a file object
name.read() - file's entire contents as a string
name.readline() - next line from file as a string
name.readlines() - file's contents as a list of lines
– the lines from a file object can also be read using a for loop
>>> f = open("hours.txt")
>>> f.read()
'123 Susan 12.5 8.1 7.6 3.2n
456 Brad 4.0 11.6 6.5 2.7 12n
789 Jenn 8.0 8.0 8.0 8.0 7.5n'
18
File Input Template
• A template for reading files in Python:
name = open("filename")
for line in name:
statements
>>> input = open("hours.txt")
>>> for line in input:
... print(line.strip()) # strip() removes n
123 Susan 12.5 8.1 7.6 3.2
456 Brad 4.0 11.6 6.5 2.7 12
789 Jenn 8.0 8.0 8.0 8.0 7.5
19
Writing Files
name = open("filename", "w")
name = open("filename", "a")
– opens file for write (deletes previous contents), or
– opens file for append (new data goes after previous data)
name.write(str) - writes the given string to the file
name.close() - saves file once writing is done
>>> out = open("output.txt", "w")
>>> out.write("Hello, world!n")
>>> out.write("How are you?")
>>> out.close()
>>> open("output.txt").read()
'Hello, world!nHow are you?'
Question 3. Swiss-Knife.py
• Using a database as input ! Parse
the entire Swiss Prot collection
– How many entries are there ?
– Average Protein Length (in aa and
MW)
– Relative frequency of amino acids
• Compare to the ones used to construct
the PAM scoring matrixes from 1978 –
1991
Question 3: Getting the database
Uniprot_sprot.dat.gz – 528Mb
(on Github onder Files)
Unzipped 2.92 Gb !
http://www.ebi.ac.uk/uniprot/download-center
Amino acid frequencies
1978 1991
L 0.085 0.091
A 0.087 0.077
G 0.089 0.074
S 0.070 0.069
V 0.065 0.066
E 0.050 0.062
T 0.058 0.059
K 0.081 0.059
I 0.037 0.053
D 0.047 0.052
R 0.041 0.051
P 0.051 0.051
N 0.040 0.043
Q 0.038 0.041
F 0.040 0.040
Y 0.030 0.032
M 0.015 0.024
H 0.034 0.023
C 0.033 0.020
W 0.010 0.014
Second step: Frequencies of Occurence
Extra Questions
• How many records have a sequence of length 260?
• What are the first 20 residues of 143X_MAIZE?
• What is the identifier for the record with the
shortest sequence? Is there more than one record
with that length?
• What is the identifier for the record with the
longest sequence? Is there more than one record
with that length?
• How many contain the subsequence "ARRA"?
• How many contain the substring "KCIP-1" in the
description?
Question 4
• Program your own prosite parser !
• Download prosite pattern database
(prosite.dat)
• Automatically generate >2000 search
patterns, and search in sequence set
from question 1

2015 bioinformatics python_io_wim_vancriekinge

  • 2.
  • 3.
  • 4.
    Overview What is Python? Why Python 4 Bioinformatics ? How to Python IDE: Eclipse & PyDev / Athena Code Sharing: Git(hub) Strings Regular expressions
  • 5.
    Python • Programming languagesare overrated – If you are going into bioinformatics you probably learn/need multiple – If you know one you know 90% of a second • Choice does matter but it matters far less than people think it does • Why Python? – Lets you start useful programs asap – Build-in libraries – incl BioPython – Free, most platforms, widely (scientifically) used • Versus Perl? – Incredibly similar – Consistent syntax, indentation
  • 6.
    Version 2.7 and3.4 on athena.ugent.be
  • 7.
    Where is theworkspace ?
  • 8.
    GitHub: Hosted GIT •Largest open source git hosting site • Public and private options • User-centric rather than project-centric • http://github.ugent.be (use your Ugent login and password) – Accept invitation from Bioinformatics-I- 2015 URI: – https://github.ugent.be/Bioinformatics-I- 2015/Python.git
  • 9.
    Run Install.py (isBioPython installed ?) import pip import sys import platform import webbrowser print ("Python " + platform.python_version()+ " installed packages:") installed_packages = pip.get_installed_distributions() installed_packages_list = sorted(["%s==%s" % (i.key, i.version) for i in installed_packages]) print(*installed_packages_list,sep="n")
  • 10.
    Control Structures if condition: statements [elifcondition: statements] ... else: statements while condition: statements for var in sequence: statements break continue
  • 11.
    Lists • Flexible arrays,not Lisp-like linked lists • a = [99, "bottles of beer", ["on", "the", "wall"]] • Same operators as for strings • a+b, a*3, a[0], a[-1], a[1:], len(a) • Item and slice assignment • a[0] = 98 • a[1:2] = ["bottles", "of", "beer"] -> [98, "bottles", "of", "beer", ["on", "the", "wall"]] • del a[-1] # -> [98, "bottles", "of", "beer"]
  • 12.
    Dictionaries • Hash tables,"associative arrays" • d = {"duck": "eend", "water": "water"} • Lookup: • d["duck"] -> "eend" • d["back"] # raises KeyError exception • Delete, insert, overwrite: • del d["water"] # {"duck": "eend", "back": "rug"} • d["back"] = "rug" # {"duck": "eend", "back": "rug"} • d["duck"] = "duik" # {"duck": "duik", "back": "rug"}
  • 13.
    Regex.py text = 'abbaaabbbbaaaaa' pattern= 'ab' for match in re.finditer(pattern, text): s = match.start() e = match.end() print ('Found "%s" at %d:%d' % (text[s:e], s, e))
  • 14.
    Find the answerin ultimate-sequence.txt ? >ultimate-sequence ACTCGTTATGATATTTTTTTTGAACGTGAAAATACT TTTCGTGCTATGGAAGGACTCGTTATCGTGAAGT TGAACGTTCTGAATGTATGCCTCTTGAAATGGA AAATACTCATTGTTTATCTGAAATTTGAATGGGA ATTTTATCTACAATGTTTTATTCTTACAGAACAT TAAATTGTGTTATGTTTCATTTCACATTTTAGTA GTTTTTTCAGTGAAAGCTTGAAAACCACCAAGA AGAAAAGCTGGTATGCGTAGCTATGTATATATA AAATTAGATTTTCCACAAAAAATGATCTGATAA ACCTTCTCTGTTGGCTCCAAGTATAAGTACGAAA AGAAATACGTTCCCAAGAATTAGCTTCATGAGT AAGAAGAAAAGCTGGTATGCGTAGCTATGTATA TATAAAATTAGATTTTCCACAAAAAATGATCTG ATAA Question 2
  • 15.
  • 16.
    Hint 2: Translations Pythonway: tab = str.maketrans("ACGU","UGCA") sequence = sequence.translate(tab)[::-1]
  • 17.
    17 Reading Files name =open("filename") – opens the given file for reading, and returns a file object name.read() - file's entire contents as a string name.readline() - next line from file as a string name.readlines() - file's contents as a list of lines – the lines from a file object can also be read using a for loop >>> f = open("hours.txt") >>> f.read() '123 Susan 12.5 8.1 7.6 3.2n 456 Brad 4.0 11.6 6.5 2.7 12n 789 Jenn 8.0 8.0 8.0 8.0 7.5n'
  • 18.
    18 File Input Template •A template for reading files in Python: name = open("filename") for line in name: statements >>> input = open("hours.txt") >>> for line in input: ... print(line.strip()) # strip() removes n 123 Susan 12.5 8.1 7.6 3.2 456 Brad 4.0 11.6 6.5 2.7 12 789 Jenn 8.0 8.0 8.0 8.0 7.5
  • 19.
    19 Writing Files name =open("filename", "w") name = open("filename", "a") – opens file for write (deletes previous contents), or – opens file for append (new data goes after previous data) name.write(str) - writes the given string to the file name.close() - saves file once writing is done >>> out = open("output.txt", "w") >>> out.write("Hello, world!n") >>> out.write("How are you?") >>> out.close() >>> open("output.txt").read() 'Hello, world!nHow are you?'
  • 20.
    Question 3. Swiss-Knife.py •Using a database as input ! Parse the entire Swiss Prot collection – How many entries are there ? – Average Protein Length (in aa and MW) – Relative frequency of amino acids • Compare to the ones used to construct the PAM scoring matrixes from 1978 – 1991
  • 21.
    Question 3: Gettingthe database Uniprot_sprot.dat.gz – 528Mb (on Github onder Files) Unzipped 2.92 Gb ! http://www.ebi.ac.uk/uniprot/download-center
  • 22.
    Amino acid frequencies 19781991 L 0.085 0.091 A 0.087 0.077 G 0.089 0.074 S 0.070 0.069 V 0.065 0.066 E 0.050 0.062 T 0.058 0.059 K 0.081 0.059 I 0.037 0.053 D 0.047 0.052 R 0.041 0.051 P 0.051 0.051 N 0.040 0.043 Q 0.038 0.041 F 0.040 0.040 Y 0.030 0.032 M 0.015 0.024 H 0.034 0.023 C 0.033 0.020 W 0.010 0.014 Second step: Frequencies of Occurence
  • 23.
    Extra Questions • Howmany records have a sequence of length 260? • What are the first 20 residues of 143X_MAIZE? • What is the identifier for the record with the shortest sequence? Is there more than one record with that length? • What is the identifier for the record with the longest sequence? Is there more than one record with that length? • How many contain the subsequence "ARRA"? • How many contain the substring "KCIP-1" in the description?
  • 24.
    Question 4 • Programyour own prosite parser ! • Download prosite pattern database (prosite.dat) • Automatically generate >2000 search patterns, and search in sequence set from question 1