Bioinformatica 27-10-2011-p4-files


Published on


Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bioinformatica 27-10-2011-p4-files

  1. 2. FBW 29-10-2008 Wim Van Criekinge
  2. 3. Programming <ul><li>Variables </li></ul><ul><li>Flow control (if, regex …) </li></ul><ul><li>Loops </li></ul><ul><li>input/output </li></ul><ul><li>Subroutines/object </li></ul>
  3. 4. Three Basic Data Types <ul><li>Scalars - $ </li></ul><ul><li>Arrays of scalars - @ </li></ul><ul><li>Associative arrays of scalers or Hashes - % </li></ul>
  4. 5. <ul><li>[m]/PATTERN/[g][i][o] </li></ul><ul><li>s/PATTERN/PATTERN/[g][i][e][o] </li></ul><ul><li>tr/PATTERNLIST/PATTERNLIST/[c][d][s] </li></ul>
  5. 6. The ‘structure’ of a Hash <ul><li>An array looks something like this: </li></ul><ul><li>A hash looks something like this: </li></ul><ul><ul><li>@array = </li></ul></ul>Index Value Key (name) Value <ul><ul><li>%phone = </li></ul></ul>
  6. 7. <ul><li>First, create a list of keys. Fortunately, there is a function for that: </li></ul><ul><ul><li>keys %hash (returns a list of keys) </li></ul></ul><ul><li>Next, visit each key and print its associated value: </li></ul><ul><ul><li>foreach (keys %hash){ </li></ul></ul><ul><ul><ul><li>print “The key $_ has the value $hash{$_}n”; </li></ul></ul></ul><ul><ul><ul><li>} </li></ul></ul></ul><ul><li>One complication. Hashes do not maintain any sort of order. In other words, if you put key/value pairs into a hash in a particular order, you will not get them out in that order!! </li></ul>Printing a hash (continued)
  7. 8. <ul><li>my %AA1 = ( </li></ul><ul><li>'UUU','F', </li></ul><ul><li>'UUC','F', </li></ul><ul><li>'UUA','L', </li></ul><ul><li>'UUG','L', </li></ul><ul><li>'UCU','S', </li></ul><ul><li>'UCC','S', </li></ul><ul><li>'UCA','S', </li></ul><ul><li>'UCG','S', </li></ul><ul><li>'UAU','Y', </li></ul><ul><li>'UAC','Y', </li></ul><ul><li>'UAA','*', </li></ul><ul><li>'UAG','*', </li></ul><ul><li>'UGU','C', </li></ul><ul><li>'UGC','C', </li></ul><ul><li>'UGA','*', </li></ul><ul><li>'UGG','W', </li></ul><ul><li>'CUU','L', </li></ul><ul><li>'CUC','L', </li></ul><ul><li>'CUA','L', </li></ul><ul><li>'CUG','L', </li></ul><ul><li>'CCU','P', </li></ul><ul><li>'CCC','P', </li></ul><ul><li>'CCA','P', </li></ul><ul><li>'CCG','P', </li></ul><ul><li>'CAU','H', </li></ul><ul><li>'CAC','H', </li></ul><ul><li>'CAA','Q', </li></ul><ul><li>'CAG','Q', </li></ul><ul><li>'CGU','R', </li></ul><ul><li>'CGC','R', </li></ul><ul><li>'CGA','R', </li></ul><ul><li>'CGG','R', </li></ul>'AUU','I', 'AUC','I', 'AUA','I', 'AUG','M', 'ACU','T', 'ACC','T', 'ACA','T', 'ACG','T', 'AAU','N', 'AAC','N', 'AAA','K', 'AAG','K', 'AGU','S', 'AGC','S', 'AGA','R', 'AGG','R', 'GUU','V', 'GUC','V', 'GUA','V', 'GUG','V', 'GCU','A', 'GCC','A', 'GCA','A', 'GCG','A', 'GAU','D', 'GAC','D', 'GAA','E', 'GAG','E', 'GGU','G', 'GGC','G', 'GGA','G', 'GGG','G' );
  8. 9. <ul><li>There is more than one right way to do it. Unfortunately, there are also many wrong ways. </li></ul><ul><ul><li>1. Always check and make sure the output is correct and logical </li></ul></ul><ul><ul><ul><li>Consider what errors might occur, and take steps to ensure that you are accounting for them. </li></ul></ul></ul><ul><ul><li>2. Check to make sure you are using every variable you declare. </li></ul></ul><ul><ul><ul><li>Use Strict ! </li></ul></ul></ul><ul><ul><li>3. Always go back to a script once it is working and see if you can eliminate unnecessary steps. </li></ul></ul><ul><ul><ul><li>Concise code is good code. </li></ul></ul></ul><ul><ul><ul><li>You will learn more if you optimize your code. </li></ul></ul></ul><ul><ul><ul><li>Concise does not mean comment free. Please use as many comments as you think are necessary. </li></ul></ul></ul><ul><ul><ul><li>Sometimes you want to leave easy to understand code in, rather than short but difficult to understand tricks. Use your judgment. </li></ul></ul></ul><ul><ul><ul><li>Remember that in the future, you may wish to use or alter the code you wrote today. If you don’t understand it today, you won’t tomorrow. </li></ul></ul></ul>Programming in general and Perl in particular
  9. 10. <ul><li>Develop your program in stages. Once part of it works, save the working version to another file (or use a source code control system like RCS) before continuing to improve it. </li></ul><ul><li>When running interactively, show the user signs of activity. There is no need to dump everything to the screen (unless requested to), but a few words or a number change every few minutes will show that your program is doing something. </li></ul><ul><li>Comment your script. Any information on what it is doing or why might be useful to you a few months later. </li></ul><ul><li>Decide on a coding convention and stick to it. For example, </li></ul><ul><ul><li>for variable names, begin globals with a capital letter and privates (my) with a lower case letter </li></ul></ul><ul><ul><li>indent new control structures with (say) 2 spaces </li></ul></ul><ul><ul><li>line up closing braces, as in: if (....) { ... ... } </li></ul></ul><ul><ul><li>Add blank lines between sections to improve readibility </li></ul></ul>Programming in general and Perl in particular
  11. 12. File input / output <ul><li>Opening a filehandle </li></ul><ul><li>In order to use a filehandle other than STDIN, STDOUT and STDERR, the filehandle needs to be opened. The open function opens a file or device and associates it with a filehandle. </li></ul><ul><li>It returns 1 upon success and undef otherwise. </li></ul><ul><li>Examples </li></ul><ul><li># open a filehandle for reading: open (SOURCE_FILE, &quot;filename&quot;); </li></ul><ul><li># or open (SOURCE_FILE, &quot;<filename&quot;); </li></ul><ul><li># open a filehandle for writing: open (RESULT_FILE, &quot;>filename&quot;); </li></ul><ul><li># open a filehandle for appending: open (LOGFILE, &quot;>>filename&quot;; </li></ul>
  12. 13. File input / output <ul><li>Closing a filehandle </li></ul><ul><li>When you are finished with a filehandle, you may close it with the close function. The close function closes the file or device associated with the filehandle. </li></ul><ul><li>Example: </li></ul><ul><li>close (MY_FILE_HANDLE); Filehandles are automatically closed when the program exits, or when the filehandle is reopened. </li></ul>
  13. 14. File input / output <ul><li>The die function </li></ul><ul><li>Sometimes the open function fails. For example, opening a file for input might fail because the file does not exist, and opening a file for output might fail because the file does not have a write permission. A perl program will nevertheless use the filehandle, and will not warn you that all input and output activities are actually meaningless. </li></ul><ul><li>Therefore, it is recommended to explicitly check the result of the open command, and if it fails to print an error message and exit the program. </li></ul><ul><li>This is easily done using the die function. </li></ul><ul><li>Example: </li></ul><ul><li>my $k = open (FILEHANDLE, &quot;filename&quot;); unless ($k) { die ( &quot;cannot open file filename: $!&quot; ) ; } # in case file &quot;filename&quot; cannot be opened, # the argument of die will be printed on # the screen and the program will exit. # $! is a special variable that contains the respective # error message sent by the operating system.. A short hand: </li></ul><ul><li>open (FILEHANDLE, &quot;filename&quot;) || die &quot;cannot open file filename: $!&quot;; </li></ul>
  14. 15. Using filehandles for writing <ul><li>Example: </li></ul><ul><li>#!/usr/local/bin/perl use strict; </li></ul><ul><li>use warnings; </li></ul><ul><li>open (OUTF, &quot;>out_file&quot;) || die &quot;cannot open out_file: $!&quot;; open (LOGF, &quot;>>log_file&quot;) || die &quot;cannot open log_file: $!&quot;; </li></ul><ul><li>print OUTF &quot;Here is my program outputn&quot;; </li></ul><ul><li>print LOGF &quot;First task of my program completedn&quot;; </li></ul><ul><li>print &quot;Nice, isn't it?n&quot;; # will be printed on the screen close (OUTF); </li></ul><ul><li>close (LOGF); </li></ul>
  15. 16. <ul><li>When <FILEHANDLE> is assigned into an array variable, all lines up to the end of the file are read at once. Each line becomes a separate element of the array. </li></ul><ul><li>#!/usr/local/bin/perl </li></ul><ul><li>use strict; </li></ul><ul><li>use warnings; </li></ul><ul><li>my $infile = &quot;CEACAM3.txt&quot;; </li></ul><ul><li>open (FH, $infile) || die &quot;cannot open &quot;$infile&quot;: $!&quot;; </li></ul><ul><li>my @lines = <FH>; </li></ul><ul><li>chomp (@lines); # chomp each element of @lines </li></ul><ul><li>close (FH); </li></ul><ul><li># to process the lines you might wish to iterate </li></ul><ul><li># over the @lines array with a foreach loop: </li></ul><ul><li>my $line; </li></ul><ul><li>foreach $line (@lines) { </li></ul><ul><li># process $line. here we just print it. </li></ul><ul><li>print &quot;$linen&quot;; </li></ul><ul><li>} </li></ul>Using filehandles for reading (2/3)
  16. 17. <ul><li>#!/usr/local/bin/perl </li></ul><ul><li>use strict; </li></ul><ul><li>use warnings; </li></ul><ul><li>my $infile = &quot;CEACAM3.txt&quot;; </li></ul><ul><li>my ($line1, $line2, $line3); </li></ul><ul><li>open (FH, $infile) || die &quot;cannot open &quot;$infile&quot;: $!&quot;; </li></ul><ul><li>$line1 = <FH>; # read first line </li></ul><ul><li>print $line1; # proccess line (here we only print it) </li></ul><ul><li>$line2 = <FH>; # read next line </li></ul><ul><li>print $line2; # proccess line (here we only print it) </li></ul><ul><li>$line3 = <FH>; # read next line </li></ul><ul><li>print $line3; # proccess line (here we only print it) </li></ul><ul><li>close (FH); </li></ul>Using filehandles for reading (1/3)
  17. 18. <ul><li>Using a while loop, read one line at a time and assign it into a scalar variable, as long as the variable is not an empty string (which will happen at end-of-file). </li></ul><ul><li>Note that a blank line read from the file will not result in an empty string, since it still contains the terminating n. </li></ul><ul><li>#!/usr/local/bin/perl </li></ul><ul><li>use strict; </li></ul><ul><li>use warnings; </li></ul><ul><li>my $infile = &quot;CEACAM3.txt&quot;; </li></ul><ul><li>open (FH, $infile) || die &quot;cannot open &quot;$infile&quot;: $!&quot;; </li></ul><ul><li>my $line; # or, in one line: </li></ul><ul><li>while ($line = <FH>) { # while (my $line = <FH>) { </li></ul><ul><li>chomp ($line); </li></ul><ul><li>print &quot;$linen&quot;; # process line. here we just print it. </li></ul><ul><li>} </li></ul><ul><li>close (FH); </li></ul>Using filehandles for reading (3/3)
  18. 19. <ul><li>Demo: Prosite Parser </li></ul>
  19. 20. 1. <ul><li>Database </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li>How many entries are there ? </li></ul></ul><ul><ul><li>Average Protein Length (in aa and MW) </li></ul></ul><ul><ul><li>Relative frequency of amino acids </li></ul></ul><ul><ul><ul><li>Compare to the ones used to construct the PAM scoring matrixes from 1978 – 1991 </li></ul></ul></ul>
  20. 21. Amino acid frequencies <ul><li>1978 1991 </li></ul><ul><li>L 0.085 0.091 </li></ul><ul><li>A 0.087 0.077 </li></ul><ul><li>G 0.089 0.074 </li></ul><ul><li>S 0.070 0.069 </li></ul><ul><li>V 0.065 0.066 </li></ul><ul><li>E 0.050 0.062 </li></ul><ul><li>T 0.058 0.059 </li></ul><ul><li>K 0.081 0.059 </li></ul><ul><li>I 0.037 0.053 </li></ul><ul><li>D 0.047 0.052 </li></ul><ul><li>R 0.041 0.051 </li></ul><ul><li>P 0.051 0.051 </li></ul><ul><li>N 0.040 0.043 </li></ul><ul><li>Q 0.038 0.041 </li></ul><ul><li>F 0.040 0.040 </li></ul><ul><li>Y 0.030 0.032 </li></ul><ul><li>M 0.015 0.024 </li></ul><ul><li>H 0.034 0.023 </li></ul><ul><li>C 0.033 0.020 </li></ul><ul><li>W 0.010 0.014 </li></ul>Second step: Frequencies of Occurence
  21. 22. <ul><li>#! C:Perlbinperl.exe -w </li></ul><ul><li># (Vergeet niet het pad van perl.exe hierboven aan te passen aan de plaats op je eigen computer) </li></ul><ul><li># Voorbeeld van het gebruik van substrings en files </li></ul><ul><li># in een parser van sequentie-informatie-records </li></ul><ul><li>use strict; </li></ul><ul><li>use warnings; </li></ul><ul><li>my ($sp_file,$line,$id,$ac,$de); </li></ul><ul><li>$sp_file= &quot;sp.txt&quot;; </li></ul><ul><li>open (SP,$sp_file) || die &quot;cannot open &quot;$sp_file&quot;:$!&quot;; </li></ul><ul><li>while ($line=<SP>){ </li></ul><ul><li>chomp($line); </li></ul><ul><li>my $field = substr ($line,0,2); </li></ul><ul><li>my $value = substr ($line,5); </li></ul><ul><li>if ($field eq &quot;ID&quot;){e </li></ul><ul><li>$id = $value </li></ul><ul><li>} </li></ul><ul><li>if ($field eq &quot;AC&quot;){ </li></ul><ul><li>$ac = $value </li></ul><ul><li>} </li></ul><ul><li>if ($field eq &quot;DE&quot;){ </li></ul><ul><li>$de = $value </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>print &quot;Identification: $idn&quot;; </li></ul><ul><li>print &quot;Accession No.: $acn&quot;; </li></ul><ul><li>print &quot;Description: $den&quot;; </li></ul>
  22. 23. 2. <ul><ul><li>Check transition matrix with and without randomizing the rows of evolutions </li></ul></ul><ul><ul><li>Adapt the program to simulate evolving DNA </li></ul></ul><ul><ul><li>Adapt the program so it generates random proteins taking into account the relative frequences found in step 1 </li></ul></ul><ul><ul><li>Write the output to a multi-fasta file </li></ul></ul><ul><ul><ul><li>>PAM1 </li></ul></ul></ul><ul><ul><ul><li>AHFALKJHFDLKFJHALSKJFH </li></ul></ul></ul><ul><ul><ul><li>>PAM2 </li></ul></ul></ul><ul><ul><ul><li>AHGALKJHFDLKFJHALSKJFH </li></ul></ul></ul><ul><ul><ul><li>>PAM3 </li></ul></ul></ul><ul><ul><ul><li>AHGALKJHFDLKFJHALSKJFH </li></ul></ul></ul><ul><ul><ul><li>… .. </li></ul></ul></ul>
  23. 24. <ul><li>Initialize: </li></ul><ul><ul><li>Generate Random protein (1000 aa) </li></ul></ul><ul><li>Simulate evolution (eg 250 for PAM250) </li></ul><ul><ul><li>Apply PAM1 Transition matrix to each amino acid </li></ul></ul><ul><ul><li>Use Weighted Random Selection </li></ul></ul><ul><li>Iterate </li></ul><ul><ul><li>Measure difference to orginal protein </li></ul></ul><ul><ul><li>Experiment: </li></ul></ul>
  24. 25. Dayhoff’s PAM1 mutation probability matrix (Transition Matrix)
  25. 26. Weighted Random Selection <ul><li>Ala => Xxx (%) </li></ul>
  26. 27. PAM-Simulator
  27. 28. <ul><li>3. Palindromes </li></ul><ul><li>What is the longest palindroom in palin.fasta ? </li></ul><ul><li>Why are restriction sites palindromic ? </li></ul><ul><li>How long is the longest palindroom in the genome ? </li></ul><ul><li>Hints: </li></ul><ul><li> </li></ul>
  29. 30. <ul><li>#!E:perlbinperl -w </li></ul><ul><li>$line_input = &quot;edellede parterretrap trap op sirenes en er is popart test&quot;; </li></ul><ul><li>$line_input =~ s/s//g; </li></ul><ul><li>$l = length($line_input); </li></ul><ul><li>for ($m = 0;$m<=$l-1;$m++) </li></ul><ul><li>{ </li></ul><ul><li>$line = substr($line_input,$m); </li></ul><ul><li>print &quot;length=$m:$lt&quot;.$line.&quot;n&quot;; </li></ul><ul><li>for $n (8..25) </li></ul><ul><li>{ </li></ul><ul><li>$re = qr /[a-z]{$n}/; </li></ul><ul><li>print &quot;pattern ($n) = $ren&quot;; </li></ul><ul><li>$regexes[$n-8] = $re; </li></ul><ul><li>} </li></ul><ul><li>foreach (@regexes) </li></ul><ul><li>{ </li></ul><ul><li>while ($line =~ m/$_/g) </li></ul><ul><li>{ </li></ul><ul><li>$endline = $'; </li></ul><ul><li>$match = $&; </li></ul><ul><li>$all = $match.$endline; </li></ul><ul><li>$revmatch = reverse($match); </li></ul><ul><li>if ($all =~ /^($revmatch)/) </li></ul><ul><li> { </li></ul><ul><li> $palindrome = $revmatch . &quot;*&quot; . $1 ; </li></ul><ul><li> $palhash{$palindrome}++; </li></ul><ul><li> } </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>print &quot;Set van palingramn&quot;; while(($key, $value) = each (%palhash)) { print &quot;$key => $valuen&quot;; }