Intro to Python programming and iPython

3/30/14

1

R. Burke Squires
Computational Genomics Specialist
Bioinformatics and Computational Biosciences Branch (BCBB)
2
Bioinformatics &
Computational Biology Branch (BCBB) Why Python?
4Source: http://xkcd.com/353/
Topics
§  iPython & Integrated Development Environments
§  Printing and manipulating text
§  Reading and writing files
§  Lists and Loops
§  Writing your own functions
§  Conditional tests
§  Regular Expressions
§  Dictionaries
§  Files, programs and user input
5
Resource:
Bioinformatics Programming
6

3/30/14

2

Goals
§  Introduce you to the basics of the python
programming language
§  Introduce you to the iPython environment and
integrated development environments (IDE)
§  Enable you to write or assemble scripts of your own or
modify existing scripts for your own purposes
§  Prepare you for the next session “Introduction to
Biopython Programming”
7 8
Programming…is it Magic?
§  No…BUT it can seem
like it at times! J
§  Working with text files
9
Python Scripts vs. Program
10
Each function
is interpreted
and executed
(slower)
Code is
compiled once;
executed as
machine code
(fastest)
11
In Your Toolbelt…
§  Python environment
§  Integrated Development Environment (IDE)
–  Continuum Analytics Anaconda
§  http://continuum.io/downloads.html
–  Enthought Canopy Express
§  https://www.enthought.com/products/epd/free/
–  iPython Notebook
§  http://ipython.org
–  PyCharm CE (Community Edition)
§  http://www.jetbrains.com/pycharm/
12

3/30/14

3

Python Environment
§  Open Terminal
§  Type “python” and hit return
•  You should see “>>>”
§  Enter “print(‘hello world’)” and hit return
§  Congratulation! You have just written your first python
script!
§  You could also put code in text file and execute:
•  $python script_name.py
13
iPython
14
iPython
§  IPython provides architecture for interactive
computing:
•  Powerful interactive shells (terminal and Qt-based).
•  A browser-based notebook with support for code,
text, mathematical expressions, inline plots and
other rich media.
•  Support for interactive data visualization and use of
GUI toolkits.
•  Easy to use, high performance tools for parallel
computing.
15
Source: ipython.org
iPython
§  Already installed
•  Source: Continuum Analytics Anaconda
–  http://continuum.io/downloads.html
§  Double-click on icon on desktop:
§  Launch the ipython-notebook
16
iPython – Home Screen
17
iPython – New Notebook
18

3/30/14

4

iPython
§  Add text using Cell -> Markdown
•  Type #Intro to Python
•  Type “This is my first iPython notebook.”
•  (To edit change to raw text)
§  Add a code cell
§  Type “print(“Hello world”)
§  Click play or run button (or Cell -> Run)
19
Source: ipython.org
iPython Notebook
20
iPython Notebook Help
21
iPython Notebook Help
§  Add images to your notebook
•  “!["DNA"](files/DNA_chain.jpg)”
•  In the same folder as notebook
§  Add YouTube Videos to your notebook:
•  from IPython.display import YouTubeVideo
•  YouTubeVideo('iwVvqwLDsJo')
22
Additional Tools
Canopy PyCharm
23
Advantages of IDEs
24
§  PyCharm features:
•  Intelligent Editor:
–  Code completion, on-the-fly error highlighting, auto-fixes, etc.
•  Automated code refactorings and rich navigation
capabilities
•  Integrated debugger and unit testing support
•  Native version control system (VCS) integrations
•  Customizable UI and key-bindings, with VIM
emulation available
Source: http://www.jetbrains.com/pycharm/

3/30/14

5

Lastly: Python IDEs in the Cloud
25
§  Python Anywhere
•  http://www.pythonanywhere.com
§  Python Fiddle: Python Cloud IDE
•  http://pythonfiddle.com
§  Koding: Free Programming Virtual Machine
•  http://koding.com
26
Printing and manipulating text:
“Hello World”
§  While in iPython:
•  Type print(“Hello world”)
•  “Run” the program
§  The whole thing is a statement; print is a function
§  Comments
•  # This is a comment!
27
Storing String in Variables
# store a short DNA sequence in the variable my_dna!
my_dna = "ATGCGTA"!
!
# now print the DNA sequence!
print(my_dna)
28Source: Python for Biologists, Dr. Martin Jones
Concatenation
my_dna = "AATT" + "GGCC"!
print(my_dna)!
!
upstream = "AAA"!
my_dna = upstream + "ATGC"!
# my_dna is now "AAAATGC"
Finding the Length of a String
# store the DNA sequence in a variable!
my_dna = "ATGCGAGT”!
!
# calculate the length of the sequence and store it in a
variable!
dna_length = len(my_dna)!
!
# print a message telling us the DNA sequence lenth!
print("The length of the DNA sequence is " + dna_length)!
!
my_dna = "ATGCGAGT"!
print("The length of the DNA sequence is " + str(dna_length))

3/30/14

6

Replacement
protein = "vlspadktnv"!
!
# replace valine with tyrosine!
print(protein.replace("v", "y"))!
!
# we can replace more than one character!
print(protein.replace("vls", "ymt"))!
!
# the original variable is not affected!
print(protein)
Replacement
!
# print positions three to five!
print(protein[3:5])!
!
# positions start at zero, not one!
print(protein[0:6])!
!
# if we use a stop position beyond the end, it's the same as
using the end!
print(protein[0:60])
Replacement
!
# count amino acid residues!
valine_count = protein.count('v')!
lsp_count = protein.count('lsp')!
tryptophan_count = protein.count('w')!
!
# now print the counts!
print("valines: " + str(valine_count))!
print("lsp: " + str(lsp_count))!
print("tryptophans: " + str(tryptophan_count))
Homework
Calculating AT content!
Here's a short DNA sequence:!
!
ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT!
!
Write a program that will print out the AT content of this
DNA sequence. Hint: you can use normal mathematical symbols
like add (+), subtract (-), multiply (*), divide (/) and
parentheses to carry out calculations on numbers in Python.!
!
Reminder: if you're using Python 2 rather than Python 3,
include this line at the top of your program:!
from __future__ import division
35
Reading and Writing Files:
Reading a File
my_file = open("dna.txt")!
file_contents = my_file.read()!
print(file_contents)!
!
my_file = open("dna.txt")!
my_file_contents = my_file.read()!
!
# remove the newline from the end of the file contents!
my_dna = my_file_contents.rstrip("n")!
print("sequence is " + my_dna + " and length is " +
str(dna_length))

3/30/14

7

Writing to a File
my_file = open("out.txt", "w")!
my_file.write("Hello world")!
!
# remember to close the file!
my_file.close()!
!
my_file = open("/Users/martin/Desktop/myfolder/myfile.txt")
Homework
Writing a FASTA file
FASTA file format is a commonly-used DNA and protein sequence file format. A
single sequence in FASTA format looks like this:
>sequence_name
ATCGACTGATCGATCGTACGAT
Write a program that will create a FASTA file for the following three sequences –
make sure that all sequences are in upper case and only contain the bases A, T, G
and C.
Sequence header DNA sequence
ABC123 ATCGTACGATCGATCGATCGCTAGACGTATCG
DEF456 actgatcgacgatcgatcgatcacgact
HIJ789 ACTGAC-ACTGT--ACTGTA----CATGTG
39
Lists and Loops:
Lists
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]!
conserved_sites = [24, 56, 132]!
print(apes[0])!
first_site = conserved_sites[2]!
!
chimp_index = apes.index("Pan troglodytes")!
# chimp_index is now 1!
!
nucleotides = ["T", ”C", ”A”, “G”]
last_ape = apes[-1]!
!
!
40
−1
>>> 'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[−1]
'A'
>>> 'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[−5]
'D'
>>> 'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[7 // 2]
'K'
0
>>> 'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[50]
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[50]
IndexError: string index out of range
Slicing
[m:n]
10 | Chapter 1: Primitives
Source: Python for Biologists, Dr. Martin Jones & O’Reilly Bioinformatics Programming Using Python
Lists and Loops:
Slicing & Appending Lists
ranks = ["kingdom", "phylum", "class", "order", "family"]!
lower_ranks = ranks[2:5]!
# lower ranks are class, order and family!
!
!
print("There are " + str(len(apes)) + " apes")!
apes.append("Pan paniscus")!
print("Now there are " + str(len(apes)) + " apes")!
Lists and Loops:
Concatenating, Reversing & Sorting Lists
monkeys = ["Papio ursinus", "Macaca mulatta"]!
primates = apes + monkeys!
print(str(len(apes)) + " apes")!
print(str(len(monkeys)) + " monkeys")!
print(str(len(primates)) + " primates")!
!
!
ranks = ["kingdom", "phylum", "class", "order", "family"]!
print("at the start : " + str(ranks))!
ranks.reverse()!
print("after reversing : " + str(ranks))!
ranks.sort()!
print("after sorting : " + str(ranks))!

3/30/14

8

Lists and Loops:
Looping through Lists
for ape in apes:!
print(ape + " is an ape")!
!
!
for ape in apes:!
name_length = len(ape)!
first_letter = ape[0]!
print(ape + " is an ape. Its name starts with " + "!
" first_letter)!
print("Its name has " + str(name_length) + " letters")!
Python:
Indentation
for ape in apes:!
name_length = len(ape)!
first_letter = ape[0]!
print(ape + " is an ape. Its name starts with " + !
" first_letter)!
print("Its name has " + str(name_length) + " letters")!
!
Indentation errors!
!
Use tabs or spaces but not both !
Lists and Loops:
Using Strings as Lists, Splitting
name = "martin"!
for character in name:!
print("one character is " + character)!
!
!
names = "melanogaster,simulans,yakuba,ananassae"!
species = names.split(",")!
print(str(species))!
Lists and Loops:
Looping through File, Line by Line
file = open("some_input.txt")!
for line in file:!
# do something with the line!
Lists and Loops:
Looping with Ranges
protein = "vlspadktnv”!
vls!
vlsp!
vlspa…!
!
!
stop_positions = [3,4,5,6,7,8,9,10]!
for stop in stop_positions:!
substring = protein[0:stop]!
print(substring)!
!
for number in range(3, 8):!
print(number)!
!
for number in range(6):!
print(number)!
Lists and Loops:
Looping with Ranges
protein = "vlspadktnv”!
vls!
vlsp!
vlspa…!
!
!
stop_positions = [3,4,5,6,7,8,9,10]!
for stop in stop_positions:!
substring = protein[0:stop]!
print(substring)!
!
for number in range(3, 8):!
print(number)!
!
for number in range(6):!
print(number)!

3/30/14

9

Lists and Loops:
Homework
§  Processing DNA in a file
•  The file input.txt contains a number of DNA
sequences, one per line. Each sequence starts with
the same 14 base pair fragment – a sequencing
adapter that should have been removed. Write a
program that will (a) trim this adapter and write the
cleaned sequences to a new file and (b) print the
length of each sequence to the screen.
49Source: Python for Biologists, Dr. Martin Jones 50
Writing Your Own Functions:
Convert Code to Function
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT”!
length = len(my_dna)!
a_count = my_dna.count('A’)!
t_count = my_dna.count('T’)!
at_content = (a_count + t_count) / length!
print("AT content is " + str(at_content))!
==============================================!
from __future__ import division #if using python 2!
!
def get_at_content(dna):!
length = len(dna)!
a_count = dna.count('A’)!
t_count = dna.count('T’)!
return at_content!
==============================================!
print("AT content is " + str(get_at_content("ATGACTGGACCA")))
Improving our Function
def get_at_content(dna, sig_figs):!
length = len(dna)!
a_count = dna.upper().count('A')!
t_count = dna.upper().count('T')!
return round(at_content, sig_figs)!
!
!
test_dna = "ATGCATGCAACTGTAGC"!
print(get_at_content(test_dna, 1))!
print(get_at_content(test_dna, 2))!
print(get_at_content(test_dna, 3))
Improving our Function
§  Functions do not always have to take parameters
§  Functions do not always have to return a value
!
def get_at_content():!
test_dna = "ATGCATGCAACTGTAGC"!
length = len(dna)!
a_count = dna.upper().count('A')!
t_count = dna.upper().count('T')!
print(round(at_content, sig_figs))!
!
!
§  What are the disadvantages of doing these things?
Defaults & Named Arguments
§  Function arguments can be named
§  Order then does not matter!
!
get_at_content(dna="ATCGTGACTCG", sig_figs=2)!
get_at_content(sig_figs=2, dna="ATCGTGACTCG")!
!
§  Functions can have default values
§  Default values do not need to be provided unless a
different value is desired
!
def get_at_content(dna, sig_figs=2):!
(function code)!

3/30/14

10

Testing Functions
§  Functions should be testing with know good values
§  Functions should be tested with known bad values!
!
assert get_at_content("ATGC") == 0.5!
assert get_at_content("A") == 1!
assert get_at_content("G") == 0!
assert get_at_content("ATGC") == 0.5!
assert get_at_content("AGG") == 0.33!
assert get_at_content("AGG", 1) == 0.3!
assert get_at_content("AGG", 5) == 0.33333!
Conditional Tests:
True, False, If…else…elif…then
§  Python has a built-in values “True”, “False”
§  Conditional statements evaluate to True or False
§  If statements use conditional statements
expression_level = 125!
if expression_level > 100:!
print("gene is highly expressed")!
!
expression_level = 125!
if expression_level > 100:!
print("gene is highly expressed")!
else:!
print("gene is lowly expressed")
Conditional Tests:
True, False, If…else…elif…then
file1 = open("one.txt", "w")!
file2 = open("two.txt", "w")!
file3 = open("three.txt", "w")!
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']!
for accession in accs:!
if accession.startswith('a'):!
file1.write(accession + "n")!
elif accession.startswith('b'):!
file2.write(accession + "n")!
else:!
file3.write(accession + "n")
Conditional Tests:
While loops
§  While loops loop until a condition is met
count = 0!
while count<10:!
print(count)!
count = count + 1
Conditional Tests:
While Loops
§  While loops loop until a condition is met
count = 0!
while count<10:!
print(count)!
count = count + 1

3/30/14

11

Conditional Tests:
Building Complex Conditions
§  Use “and”, “or”, “not and”, “not or” to build complex
conditions
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']!
for accession in accs:!
if accession.startswith('a') and accession.endswith('3'):!
print(accession)
Regular Expressions:
Patterns in Biology
§  There are a lot of patterns in biology:
–  protein domains
–  DNA transcription factor binding motifs
–  restriction enzyme cut sites
–  runs of mononucleotides
§  Pattern in strings inside text:
–  read mapping locations
–  geographical sample coordinates
–  taxonomic names
–  gene names
–  gene accession numbers
–  BLAST searches
Patterns in Biology
§  Many problems that we want to solve that require
more flexible patterns:
–  Given a DNA sequence, what's the length of the poly-A tail?
–  Given a gene accession name, extract the part between the
third character and the underscore
–  Given a protein sequence, determine if it contains this highly-
redundant domain motif
Modules in Python
§  To search for these patterns, we use the regular expression
module “re”
import re!
!
re.search(pattern, string)!
!
dna = "ATCGCGAATTCAC"!
if re.search(r"GAATTC", dna):!
print("restriction site found!")!
!
if re.search(r"GC(A|T|G|C)GC", dna):!
print("restriction site found!")!
!
if re.search(r"GC[ATGC]GC", dna):!
print("restriction site found!")
Get String and Position of Match
§  Get the string that matched
dna = "ATGACGTACGTACGACTG"!
# store the match object in the variable m!
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)!
print("entire match: " + m.group())!
print("first bit: " + m.group(1))!
print("second bit: " + m.group(2))!
!
§  Get the positions of the match
!
print("start: " + str(m.start()))!
print("end: " + str(m.end()))

3/30/14

12

67
Dictionaries:
Storing Paired Data
enzymes = {}!
enzymes['EcoRI'] = r'GAATTC'!
enzymes['AvaII] = r'GG(A|T)CC'!
enzymes['BisI'] = r'GC[ATGC]GC'!
!
# remove the EcoRI enzyme from the dict!
enzymes.pop('EcoRI')!
!
dna = "AATGATCGATCGTACGCTGA"!
counts = {}!
for base1 in ['A', 'T', 'G', 'C']:!
for base2 in ['A', 'T', 'G', 'C']:!
for base3 in ['A', 'T', 'G', 'C']:!
trinucleotide = base1 + base2 + base3!
count = dna.count(trinucleotide)!
counts[trinucleotide] = count!
print(counts)
Dictionaries:
Storing Paired Data
!
{'ACC': 0, 'ATG': 1, 'AAG': 0, 'AAA': 0, 'ATC': 2, 'AAC': 0,
'ATA': 0, 'AGG': 0, 'CCT': 0, 'CTC': 0, 'AGC': 0, 'ACA': 0,
'AGA': 0, 'CAT': 0, 'AAT': 1, 'ATT': 0, 'CTG': 1, 'CTA': 0,
'ACT': 0, 'CAC': 0, 'ACG': 1, 'CAA': 0, 'AGT': 0, 'CAG': 0,
'CCG': 0, 'CCC': 0, 'CTT': 0, 'TAT': 0, 'GGT': 0, 'TGT': 0,
'CGA': 1, 'CCA': 0, 'TCT': 0, 'GAT': 2, 'CGG': 0, 'TTT': 0,
'TGC': 0, 'GGG': 0, 'TAG': 0, 'GGA': 0, 'TAA': 0, 'GGC': 0,
'TAC': 1, 'TTC': 0, 'TCG': 2, 'TTA': 0, 'TTG': 0, 'TCC': 0,
'GAA': 0, 'TGG': 0, 'GCA': 0, 'GTA': 1, 'GCC': 0, 'GTC': 0,
'GCG': 0, 'GTG': 0, 'GAG': 0, 'GTT': 0, 'GCT': 1, 'TGA': 2,
'GAC': 0, 'CGT': 1, 'TCA': 0, 'CGC': 1}!
!
print(counts['TGA'])
Dictionaries:
Storing Paired Data
if 'AAA' in counts:!
print(counts('AAA'))!
!
for trinucleotide in counts.keys():!
if counts.get(trinucleotide) == 2:!
print(trinucleotide)!
!
for trinucleotide in sorted(counts.keys()):!
if counts.get(trinucleotide) == 2:!
print(trinucleotide)!
!
for trinucleotide, count in counts.items():!
if count == 2:!
print(trinucleotide)
71
Files, Programs, & User Input:
Basic File Manipulation
§  Rename a file
!
import os!
os.rename("old.txt", "new.txt")!
!
§  Rename a folder
!
os.rename("/home/martin/old_folder", "/home/martin/
new_folder")!
!
§  Check to see if a file exists
!
if os.path.exists("/home/martin/email.txt"):!
print("You have mail!")

3/30/14

13

Basic File Manipulation
§  Remove a file
os.remove("/home/martin/unwanted_file.txt")!
§  Remove empty folder
os.rmdir("/home/martin/emtpy")!
§  To delete a folder and all the files in it, use
shutil.rmtree
shutil.rmtree("home/martin/full")
Running External Programs
§  Run an external program
import subprocess!
subprocess.call("/bin/date")!
!
§  Run an external program with options
!
subprocess.call("/bin/date +%B", shell=True)!
!
§  Saving program output
!
current_month = subprocess.check_output("/bin/date +%B",
shell=True)
User Input
§  Interactive user input
accession = input("Enter the accession name")!
# do something with the accession variable!
§  Capture command line arguments
!
import sys!
print(sys.argv)!
# python myprogram.py one two three!
# sys.argv[1] return script name!
Goals
§  Introduce you to the basics of the python
programming language
§  Introduced you to the iPython environment
§  Prepare you for the next session “Introduction to
Biopython for Scientists”
§  Enable you to write or assemble scripts of your own or
modify existing scripts for your own purposes
77
Resources: Website
§  Websites
•  http://pythonforbiologists.com
•  http://www.pythonforbeginners.com
•  http://www.pythontutor.com/visualize.html#mode=display
§  Free eBook in HTML / PDF
•  http://pythonforbiologists.com
•  http://greenteapress.com/thinkpython/
•  http://openbookproject.net/books/bpp4awd/index.html
§  Python Regular Expressions (pattern matching)
•  http://www.pythonregex.com
§  Python Style Guide
•  http://www.python.org/dev/peps/pep-0008/
78

3/30/14

14

Additional Seminars
§  Introduction to BioPython for Scientists
§  Introduction to Data Analysis with Python
•  Utilizing NumPy and pandas modules
79
Collaborations welcome
One-on-one training available for those on NIH campus and related
agencies
ScienceApps at niaid.nih.gov
80

Intro to Python programming and iPython

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Intro to Python programming and iPython

Similar to Intro to Python programming and iPython (20)

More from Bioinformatics and Computational Biosciences Branch

More from Bioinformatics and Computational Biosciences Branch (20)

Recently uploaded

Recently uploaded (20)

Intro to Python programming and iPython