Your SlideShare is downloading. ×
Day3
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Day3

255

Published on

Day 3 of a Python intro course for biologists. …

Day 3 of a Python intro course for biologists.
Theme: how to work with files

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
255
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. File handling Karin Lagesenkarin.lagesen@bio.uio.no
  • 2. Homework● ATCurve.py ● take an input string from the user ● check if the sequence only contains DNA – if not, prompt for new sequence. ● calculate a running average of AT content along the sequence. Window size should be 3, and the step size should be 1. Print one value per line.● Note: you need to include several runtime examples to show that all parts of the code works.
  • 3. ATCurve.py - thinking● Take input from user: ● raw_input● Check for the presence of !ATCG ● use sets – very easy● Calculate AT – window = 3, step = 1 ● iterate over string in slices of three
  • 4. ATCurve.py# variable valid is used to see if the string is ok or not.valid = Falsewhile not valid: # promt user for input using raw_input() and store in string, # convert all characters into uppercase test_string = raw_input("Enter string: ") upper_string = test_string.upper() # Figure out if anything else than ATGCs are present dnaset = set(list("ATGC")) upper_string_set = set(list(upper_string)) if len(upper_string_set - dnaset) > 0: print "Non-DNA present in your string, try again" else: valid = Trueif valid: for i in range(0, len(upper_string)-3, 1): at_sum = 0.0 at_sum += upper_string.count("A",i,i+2) at_sum += upper_string.count("T",i,i+2)
  • 5. Homework● CodonFrequency.py ● take an input string from the user ● if the sequence only contains DNA – find a start codon in your string – if startcodon is present ● count the occurrences of each three-mer from start codon and onwards ● print the results
  • 6. CodonFrequency.py - thinking● First part – same as earlier● Find start codon: locate index of AUG ● Note, can simplify and find ATG● If start codon is found: ● create dictionary ● for slice of three in input[StartCodon:]: – get codon – if codon is in dict: ● add to count – if not: ● create key-value pair in dict
  • 7. CodonFrequency.pyinput = raw_input("Type a piece of DNA here: ")if len(set(input) - set(list("ATGC"))) > 0: print "Not a valid DNA sequence"else: atg = input.find("ATG") if atg == -1: print "Start codon not found" else: codondict = {} for i in xrange(atg,len(input)-3,3): codon = input[i:i+3] if codon not in codondict: codondict[codon] = 1 else: codondict[codon] +=1 for codon in codondict: print codon, codondict[codon]
  • 8. CodonFrequency.py w/ stopcodoninput = raw_input("Type a piece of DNA here: ")if len(set(input) - set(list("ATGC"))) > 0: print "Not a valid DNA sequence"else: atg = input.find("ATG") if atg == -1: print "Start codon not found" else: codondict = {} for i in xrange(atg,len(input) -3,3): codon = input[i:i+3] if codon in [UAG, UAA, UAG]: break elif codon not in codondict: codondict[codon] = 1 else: codondict[codon] +=1 for codon in codondict: print codon, codondict[codon]
  • 9. Results[karinlag@freebee]/projects/temporary/cees-python-course/Karin% python CodonFrequency2.pyType a piece of DNA here: ATGATTATTTAAATGATG 1ATT 2TAA 1[karinlag@freebee]/projects/temporary/cees-python-course/Karin% python CodonFrequency2.pyType a piece of DNA here: ATGATTATTTAAATGTATG 2ATT 2TAA 1[karinlag@freebee]/projects/temporary/cees-python-course/Karin%
  • 10. Working with files● Reading – get info into your program● Parsing – processing file contents● Writing – get info out of your program
  • 11. Reading and writing● Three-step process ● Open file – create file handle – reference to file ● Read or write to file ● Close file – will be automatically close on program end, but bad form to not close
  • 12. Opening files● Opening modes: ● “r” - read file ● “w” - write file ● “a” - append to end of file● fh = open(“filename”, “mode”)● fh = filehandle, reference to a file, NOT the file itself
  • 13. Reading a file● Three ways to read ● read([n]) - n = bytes to read, default is all ● readline() - read one line, incl. newline ● readlines() - read file into a list, one element per line, including newline
  • 14. Reading example● Log on to freebee, and go to your area● do cp ../Karin/fastafile.fsa .● open python >>> fh = open("fastafile.fsa", "r") >>> fh● Q: what does the response mean?
  • 15. Read example● Use all three methods to read the file. Print the results. ● read ● readlines ● readline● Q: what happens after you have read the file?● Q: What is the difference between the three?
  • 16. Read example>>> fh = open("fastafile.fsa", "r")>>> withread = fh.read()>>> withread>This is the description linenATGCGCTTAGGATCGATAGCGATTTAGAnTTAGCGGAn>>> withreadlines = fh.readlines()>>> withreadlines[]>>> fh = open("fastafile.fsa", "r")>>> withreadlines = fh.readlines()>>> withreadlines[>This is the description linen, ATGCGCTTAGGATCGATAGCGATTTAGAn, TTAGCGGAn]>>> fh = open("fastafile.fsa", "r")>>> withreadline = fh.readline()>>> withreadline>This is the description linen>>>
  • 17. Parsing● Getting information out of a file● Commonly used string methods ● split([character]) – default is whitespace ● replace(“in string”, “put into instead”) ● “string character”.join(list) – joins all elements in the list with string character as a separator – common construction: .join(list) ● slicing
  • 18. Type conversions● Everything that comes on the command line or from a file is a string● Conversions: ● int(X) – string cannot have decimals – floats will be floored ● float(X) ● str(X)
  • 19. Parsing example● Continue using fastafile.fsa● Print only the description line to screen● Print the whole DNA string >>> fh = open("fastafile.fsa", "r") >>> firstline = fh.readline() >>> print firstline[1:-1] This is the description line >>> sequence = >>> for line in fh: ... sequence += line.replace("n", "") ... >>> print sequence ATGCGCTTAGGATCGATAGCGATTTAGA >>>
  • 20. Accepting input from command line● Need to be able to specify file name on command line● Command line parameters stored in list called sys.argv – program name is 0● Usage: ● python pythonscript.py arg1 arg2 arg3....● In script: ● at the top of the file, write import sys ● arg1 = sys.argv[1]
  • 21. Batch example● Read fastafile.fsa with all three methods● Per method, print method, name and sequence● Remember to close the file at the end!
  • 22. Batch exampleimport sysfilename = sys.argv[1]#using readlinefh = open(filename, "r")firstline = fh.readline()name = firstline[1:-1]sequence =for line in fh: sequence += line.replace("n", "")print "Readline", name, sequence#using readlines()fh = open(filename, "r")inputlines = fh.readlines()name = inputlines[0][1:-1]sequence = for line in inputlines[1:]: sequence += line.replace("n", "")print "Readlines", name, sequence#using readfh = open(filename, "r")inputlines = fh.read()name = inputlines.split("n")[0][1:-1]sequence = "".join(inputlines.split("n")[1:])print "Read", name, sequencefh.close()
  • 23. Classroom exercise● Modify ATCurve.py script so that it accepts the following input on the command line: ● fasta filename ● window size● Let the user input an alternate filename if it contains !ATGC● Print results to screen
  • 24. ATCurve2.pyimport sys# Define filenamefilename = sys.argv[1]windowsize = int(sys.argv[2])# variable valid is used to see if the string is ok or not.valid = Falsewhile not valid: fh = open(filename, "r") inputlines = fh.readlines() name = inputlines[0][1:-1] sequence = for line in inputlines[1:]: sequence += line.replace("n", "") upper_string = sequence.upper() # Figure out if anything else than ATGCs are present dnaset = set(list("ATGC")) upper_string_set = set(list(upper_string)) if len(upper_string_set - dnaset) > 0: print "Non-DNA present in your file, try again" filename = raw_input("Type in filename: ") else: valid = Trueif valid: for i in range(0, len(upper_string)-windowsize + 1, 1): at_sum = 0.0 at_sum += upper_string.count("A",i,i+windowsize) at_sum += upper_string.count("T",i,i+windowsize) print i + 1, at_sum/windowsize
  • 25. Writing to files● Similar procedure as for read ● Open file, mode is “w” or “a” ● fh.write(string) – Note: one single string – No newlines are added ● fh.close()
  • 26. ATContent3.py● Modify previous script so that you have the following on the command line ● fasta filename for input file ● window size ● output file● Output should be on the format ● number, AT content ● number is the 1-based position of the first nucleotide in the window
  • 27. ATCurve3.py import sys # Define filename filename = sys.argv[1] windowsize = int(sys.argv[2]) outputfile = sys.argv[3]if valid: fh = open(outputfile, "w") for i in range(0, len(upper_string)-windowsize + 1, 1): at_sum = 0.0 at_sum += upper_string.count("A",i,i+windowsize) at_sum += upper_string.count("T",i,i+windowsize) fh.write(str(i + 1) + " " + str(at_sum/windowsize) + "n") fh.close()
  • 28. Homework: TranslateProtein.py● Input files are in /projects/temporary/cees-python-course/Karin ● translationtable.txt - tab separated ● dna31.fsa● Script should: ● Open the translationtable.txt file and read it into a dictionary ● Open the dna31.fsa file and read the contents. ● Translates the DNA into protein using the dictionary ● Prints the translation in a fasta format to the file TranslateProtein.fsa. Each protein line should be 60 characters long.

×