Part III


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Part III

  1. 1. Computing Concepts for Bioinformatics  To web or not to web is the question  Command line/shell review  Command line exercise  Introduction to EMBOSS  EMBOSS input/output  Programming Process  Perl concepts  Your first perl program
  2. 2. Which tools to employ and when ! Criteria Web Local User Interface Simple/Easy Not Intuitive Availability Almost Reliable* Reliable Restrictions Many Few Speed Good to Bad Fair Amount (# of seq) Limited Number High Through put Storage Limited to NONE Excellent Update (Database) Good Fair* Update (Programs) Good Fair* Maintenance Up to Provider Good Cost NONE $$$ Control NONE Excellent
  3. 3. Things to consider for web tools  Web browsers have copy/paste limit, use file upload option for large seq.  If dataset to analyze is large use e-mail service or look for local resources [Super computer, Grid, BCF]  Reproducibility: Underlying database or program may get updated while you are in middle of analyzing your dataset  Extensibility: If you have multiple steps in your analysis rely on local resources, unless you have customized web tools
  4. 4. Things to consider for web tools  Availability and Stability  Authors move and institutes take down websites [legal and political reasons]  Specially for “bouquet analysis” try and get the program, source from the author  Security: Many private sector grants require that you DO NOT use public resources  Output formats: Make sure its using standard or documented format [XML, GFF] else you cannot extend/import your analysis into other programs  Do not abuse web resources (by writing scripts to hog them)
  5. 5. Common Tasks  Count files, space used/available  Copy/Move many files [directories]  Combine files  Split files  Maximize disk space  Transferring files [between machines/ accounts]
  6. 6. Common Tasks: Space  Each account is assigned specific amount of disk space(quota)  To check how much disk space you have type: quota –v  Output is in kbytes [1000kb = 1Mb]  Always check availability of space before you start a large/intensive analysis  Running out of space will lead to incomplete analysis/corrupt reports etc.
  7. 7. Common Tasks: Files  ls to list files [like dir in dos]  -ltr [long, time stamp, reverse order]  Check manual [man ls] for many other options
  8. 8. Common Tasks: File and Space  du Disk usage  Per file/directory  -k option gives output in Kbyte  -s gives total sum [no individual]
  9. 9. Common Tasks: File manip.  Merge files using the concatenate  cat file1.txt file2.txt  cat seq1.tfa seq2.tfa > big.tfa  cat seq3.tfa seq4.tfa >> big.tfa  Splitting file content can be done using the split or csplit command ..we will do most of this using Bioperl
  10. 10. Common Tasks: Compression  Sequence data is plain text can be compressed by 60% to 75%  Compressed files are easier to handle [less time to transfer/move]  gzip filename [create filename.gz]  gunzip expands
  11. 11. Common Tasks: File transfer  scp Secure copy between machines  scp file.txt user@machine:  scp -r eeb/class2files/ For directories/recursive from amadeus to my u.arizona account  To transfer files between users
  12. 12. Common Tasks: File transfer  Windows to amadeus:  Download and install WinSCP  SSH from  Use windows explorer and mount your folder (More later)  Macintosh (OS X) to amadeus:  Download and Install Fugu (OS X)  Using finder and mount your folder cumentation/howto/html/osxsmb.html#using- finder
  13. 13. Review  Open a BioDesk session  Open Xterminal  Open Editor (nedit or your favorite)  Command line (type into Xterminal)  Remember to put space between options  cp /home/student/samples/ ./  cd test (what does cd without arguments do ?)  ls –al *.pl (what is a wild card ?)  pwd (print working directory ..where are you now ? use cd if you are lost !)  What is . and ..
  14. 14. Tips  When typing on the command line use the up and down arrow key to navigate between previous commands  Use the right and left arrow key to move along the command line (to modify stuff)  When trying to type a command use the “tab” key to autofill the options  i.e cd pu<TAB> should fill in the rest  If it does not ..provide few more characters (you may have 2 directories starting with pu (public_html and put_results)
  15. 15. Exercise 2: Working with seq.  I have approx ???? seq in ?? files  Located in the directory /home/student/2005/eeb/exercise-2/  Check your quota (quota –v) 1. Copy them to your account cp -r /home/student/2005/eeb/exercise-2 . (Use the tab key for auto fill and remember space between options) 1. Count how many files we have ? (hint: use ls and wc) (use ls and go to correct dir.) 2. Check your quota and disk space (du –k) 3. Count how many sequences we have ? cat *.tfa | grep “>” | wc -l
  16. 16. Exercise 2: Cont 1. Gzip all files (gzip –v *.tfa) 2. -v provides progress/report 3. Check disk usage 4. Use ls | more to see files 5. Use gzcat filename | more 6. Count number of sequences in 1-bacteria.tfa.gz 7. Gunzip all files (gunzip *.gz) 8. Combine 1-bacteria.tfa 2-bacteria.tfa into new.tfa (hint use cat and >) 9. Count the number of sequences in new.tfa
  17. 17. Using idfetch  This program gets sequences from Genbank based on the sequence ID  The options can be viewed by typing idfetch –  We will be using –t 5 (save as FASTA format)  -G to read the list from a file  -o to save the output to a file (you can also use >)
  18. 18. Exercise with idfetch  Copy the file id.list – it contains Genbank ids: cp /home/student/exercise/1/id.list .  Use idfetch to get the sequences: idfetch -t 5 –G id.list –o seq.out  Use more to look at the output more seq.out Space moves down 1 page Enter moves down 1 line /word jumps to the next occurrence of word n repeats the word search Q or Control-C exits more  Experiment with different values for the –i option (hint: remove –o seq.out and use idfetch – to see all options)
  19. 19. Setting up your Editor (Nedit)   Next step is already done for you  Set your preferences (syntax highlight, line number)  Save default  Exit  Restart
  20. 20. Nedit (File dialog box)  Ignore everything with .  Double click on directory or select with mouse and use “enter” key  What is . and ..  Use filter if you have many files ( *.pl )  Select the file to edit/open with mouse (should have black background) then click on OK  Save (Control-s) and Save As
  21. 21. EMBOSS   European Molecular Biology Open Software Suite (Now hosted at sourceforge)  Free Open Source software analysis package specially developed for the needs of the molecular biology community  Provides a comprehensive set of sequence analysis programs (approximately 100+)  Current version 3.0 July 2005
  22. 22. EMBOSS (programs)  Integrates other publicly available packages  Can be accessed through BioPERL modules (easy automation)  Sequence alignment  Rapid database searching with sequence patterns  Protein motif identification, including domain analysis  Nucleotide sequence pattern analysis, for example to identify CpG islands or repeats.  Codon usage analysis for small genomes  Rapid identification of sequence patterns in large scale sequence sets.  Database creation/indexing
  23. 23. Interacting with EMBOSS  EMBOSS programs are run by typing them at the UNIX prompt (in your Xterminal) with or without parameters/options  EMBOSS command syntax follows normal UNIX command conventions  It will prompt you for parameters not provided when invoking the program  In doubt use: program_name -help (seqret –help) tfm program_name ( tfm seqret )  Use wossname to search a program by keyword
  24. 24. Sequence Formats  Sequence Formats: • FASTA • GenBank • EMBL • SwissProt • PIR  FASTA format: >Seq_Name description and some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcagctctttgtcgcgcccagg agctacacac  IDs and Accessions • ID was human readable and name suggested functions etc, • Accession number are database assigned (now a days they are same as ID ) • ID 'hsfau' is the 'Homo Sapiens FAU pseudogene„ Its accession # X65923 (sometimes Accession.1 for version)  Multiple sequence per file  No connection between file name and ID  GFF and Reports (Covered later on)
  25. 25. EMBOSS & USA  USA (Uniform Sequence Address) • "format::file" • "format::file:entry" • "dbname:entry" (we don‟t have this configured) • "@listfile" (a file of file-names; ls *.seq > mylist )  Format is not required when reading in a sequence, EMBOSS will guess the sequence format by trying all known formats until one succeeds  When writing out a sequence, EMBOSS will use fasta format by default. You can specify another format to use gcg::myresults.seq embl::myresults.out
  26. 26. Additional Help With UNIX and PERL 
  27. 27. Programming Process  When asked to develop ..look around before you re invent the wheel  Requirement Analysis: What input, output, formats, source for data, frequency of update etc.  Design Phase (how and what to use)  Flow charts for (logic and data) UML, use cases  Pseudocode get filename open file and read sequences For each sequence If length is greater then 100Kb print error msg #
  28. 28. Programming …  Now start coding  Always comment your code  Use version control filename.1 etc for small project CVS  Code has to be human readable but machine parseable !  Test and debug code using different scenarios for input  Don‟t feel shy to use paper and pencil ..its easier at times 
  29. 29. Introduction to PERL  Invoking PERL Basic Input/Output STDIN, STDOUT, print and writing to files, sockets  Variables: Scalar Data Numbers 12, 12e5, -12.534 Strings “who likes Austin Powers?” Operators +, -, <, > =  Flow Control if, while, for, foreach  Arrays
  30. 30. Invoking Perl  First line of a perl program: #! /usr/local/bin/perl  # by itself means comments, i.e. the line is not interpreted.  It is important to comment your code!!! #!/usr/local/bin/perl # Program by Baha Men (Sept 09,2004) print “Who let the dogs out?n”; # The above line outputs to screen # the (only) famous song by the group
  31. 31. Variables  Variable is something that will store values while your program is running  You can set initial values of variables and modify these values as the program executes.  No need to pre define  Automatically get global scope *  You can store numbers, text in the variables Note the “ ” $a = 1; for text $z = $a + 3.1412653505; $b = “I put the cat out”; $gene_name = “C127899.1”;
  32. 32. Arithmetic Operators  $a = 1 + 2; # Add 1 and 2 and store in $a  $a = 3 - 4; # Subtract 4 from 3 and store in $a  $a = 5 * 6; # Multiply 5 and 6  $a = 7 / 8; # Divide 7 by 8 to give 0.875  $a = 9 ** 10;# Nine to the power of 10  $a = 5 % 2; # Remainder of 5 divided by 2  $a++; # Increment $a by 1  $a--; #Decrement $a by 1  if ($a <= 2) #Lesser than or equal
  33. 33. String Operator  $b = “Hello”; $c = “World”  $a = $b . $c; # Concatenate $b and $c print $a; # This is HelloWorld  We can do the same using print “$a $b from men”; # This will print Hello World from me # Followed by a newline  Difference between „ and “ covered later  n is newline this puts space between 2 lines  t is the tab operator i.e Hello World
  34. 34. Testing Values  $a == $b # Is $a numerically equal to $b ? # Beware: Don't use the = operator.  $a != $b # Is $a numerically not equal to $b?  $a eq $b # Is $a string-equal to $b?  $a ne $b # Is $a string not equal to $b?  Use == for numbers and eq for strings
  35. 35. Flow Control  for (initialize; test; increment) { first_action; second_action; for( $count = 0 ; $count < 10 ; $count++) etc { print “ Count is = $count n”;} }  while (condition) { while ($president ne “Nader”) first_action; second_action; { etc print “Try againn”; # Ask again } $president = <STDIN>; # Get input  if (condition) chomp $president; # Chop off newline { } first_action; second_action }
  36. 36. Arrays  Array variable is a list of scalars (ie numbers and/or strings).  Same format as scalar except that they have a @ i.e @names is a array while $names is scalar  @names = (“Hillary”,”Dick”,”Ralph”); @party = (“Democrat”,”Republican”,”Green”);  Array data can be referenced by using the index number which starts from 0 $names[0] is Hillary and $party[1] is Republican  You can set values using $names[3]=“Pat”;
  37. 37. I get the picture ..just get on with it !  Your first program  Create directory prog1 save files there  Print hello world (That‟s too easy)  Ask the user for a name  Greet the user  Ask the user for password  If it matches the password yahoo then greet else boot  You can type perl or chmod u+x and run it ./ (remember to cd prog1)
  38. 38. Your program #!/usr/local/bin/perl # Customary first line print “Please enter your name: “; # Prompts the user to type and hit enter $name = <STDIN>; chomp($name); #read from Keyboard and remove new line print “Hello $name please give me secret password:”; $password = <STDIN>; # Now compare it to hidden password if($password eq “yahoo” ) { print “Welcome buddyn”; } else { print “Bite Me: $password is invalidn”;}
  39. 39. Next Class  Bring ideas/questions related to your final project  NCBI Databases  NCBI e-utils  More Perl and BioPERL