Your SlideShare is downloading. ×
0
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Part III
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Part III

373

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
373
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Computing Concepts for Bioinformatics  To web or not to web is the question  Command line/shell review  Command line exercise  Introduction to EMBOSS  EMBOSS input/output  Programming Process  Perl concepts  Your first perl program http://amadeus.biosci.arizona.edu/~nirav
  • 2. Which tools to employ and when ! Criteria Web Local User Interface Simple/Easy Not Intuitive Availability Almost Reliable* Reliable Restrictions Many Few Speed Good to Bad Fair Amount (# of seq) Limited Number High Through put Storage Limited to NONE Excellent Update (Database) Good Fair* Update (Programs) Good Fair* Maintenance Up to Provider Good Cost NONE $$$ Control NONE Excellent
  • 3. Things to consider for web tools  Web browsers have copy/paste limit, use file upload option for large seq.  If dataset to analyze is large use e-mail service or look for local resources [Super computer, Grid, BCF]  Reproducibility: Underlying database or program may get updated while you are in middle of analyzing your dataset  Extensibility: If you have multiple steps in your analysis rely on local resources, unless you have customized web tools
  • 4. Things to consider for web tools  Availability and Stability  Authors move and institutes take down websites [legal and political reasons]  Specially for “bouquet analysis” try and get the program, source from the author  Security: Many private sector grants require that you DO NOT use public resources  Output formats: Make sure its using standard or documented format [XML, GFF] else you cannot extend/import your analysis into other programs  Do not abuse web resources (by writing scripts to hog them)
  • 5. Common Tasks  Count files, space used/available  Copy/Move many files [directories]  Combine files  Split files  Maximize disk space  Transferring files [between machines/ accounts]
  • 6. Common Tasks: Space  Each account is assigned specific amount of disk space(quota)  To check how much disk space you have type: quota –v  Output is in kbytes [1000kb = 1Mb]  Always check availability of space before you start a large/intensive analysis  Running out of space will lead to incomplete analysis/corrupt reports etc.
  • 7. Common Tasks: Files  ls to list files [like dir in dos]  -ltr [long, time stamp, reverse order]  Check manual [man ls] for many other options
  • 8. Common Tasks: File and Space  du Disk usage  Per file/directory  -k option gives output in Kbyte  -s gives total sum [no individual]
  • 9. Common Tasks: File manip.  Merge files using the concatenate  cat file1.txt file2.txt  cat seq1.tfa seq2.tfa > big.tfa  cat seq3.tfa seq4.tfa >> big.tfa  Splitting file content can be done using the split or csplit command ..we will do most of this using Bioperl
  • 10. Common Tasks: Compression  Sequence data is plain text can be compressed by 60% to 75%  Compressed files are easier to handle [less time to transfer/move]  gzip filename [create filename.gz]  gunzip expands
  • 11. Common Tasks: File transfer  scp Secure copy between machines  scp file.txt user@machine:  scp -r eeb/class2files/ nirav@u.arizona.edu: For directories/recursive from amadeus to my u.arizona account  To transfer files between users
  • 12. Common Tasks: File transfer  Windows to amadeus:  Download and install WinSCP http://winscp.sourceforge.net/eng/download.php  SSH from http://sitelicense.arizona.edu  Use windows explorer and mount your folder (More later)  Macintosh (OS X) to amadeus:  Download and Install Fugu (OS X) http://rsug.itd.umich.edu/software/fugu/  Using finder and mount your folder http://www.opensource.apple.com/projects/do cumentation/howto/html/osxsmb.html#using- finder
  • 13. Review  Open a BioDesk session  Open Xterminal  Open Editor (nedit or your favorite)  Command line (type into Xterminal)  Remember to put space between options  cp /home/student/samples/runme.pl ./  cd test (what does cd without arguments do ?)  ls –al *.pl (what is a wild card ?)  pwd (print working directory ..where are you now ? use cd if you are lost !)  What is . and ..
  • 14. Tips  When typing on the command line use the up and down arrow key to navigate between previous commands  Use the right and left arrow key to move along the command line (to modify stuff)  When trying to type a command use the “tab” key to autofill the options  i.e cd pu<TAB> should fill in the rest  If it does not ..provide few more characters (you may have 2 directories starting with pu (public_html and put_results)
  • 15. Exercise 2: Working with seq.  I have approx ???? seq in ?? files  Located in the directory /home/student/2005/eeb/exercise-2/  Check your quota (quota –v) 1. Copy them to your account cp -r /home/student/2005/eeb/exercise-2 . (Use the tab key for auto fill and remember space between options) 1. Count how many files we have ? (hint: use ls and wc) (use ls and go to correct dir.) 2. Check your quota and disk space (du –k) 3. Count how many sequences we have ? cat *.tfa | grep “>” | wc -l
  • 16. Exercise 2: Cont 1. Gzip all files (gzip –v *.tfa) 2. -v provides progress/report 3. Check disk usage 4. Use ls | more to see files 5. Use gzcat filename | more 6. Count number of sequences in 1-bacteria.tfa.gz 7. Gunzip all files (gunzip *.gz) 8. Combine 1-bacteria.tfa 2-bacteria.tfa into new.tfa (hint use cat and >) 9. Count the number of sequences in new.tfa
  • 17. Using idfetch  This program gets sequences from Genbank based on the sequence ID  The options can be viewed by typing idfetch –  We will be using –t 5 (save as FASTA format)  -G to read the list from a file  -o to save the output to a file (you can also use >)
  • 18. Exercise with idfetch  Copy the file id.list – it contains Genbank ids: cp /home/student/exercise/1/id.list .  Use idfetch to get the sequences: idfetch -t 5 –G id.list –o seq.out  Use more to look at the output more seq.out Space moves down 1 page Enter moves down 1 line /word jumps to the next occurrence of word n repeats the word search Q or Control-C exits more  Experiment with different values for the –i option (hint: remove –o seq.out and use idfetch – to see all options)
  • 19. Setting up your Editor (Nedit)  www.nedit.org  Next step is already done for you  Set your preferences (syntax highlight, line number)  Save default  Exit  Restart
  • 20. Nedit (File dialog box)  Ignore everything with .  Double click on directory or select with mouse and use “enter” key  What is . and ..  Use filter if you have many files ( *.pl )  Select the file to edit/open with mouse (should have black background) then click on OK  Save (Control-s) and Save As
  • 21. EMBOSS  www.emboss.org  European Molecular Biology Open Software Suite (Now hosted at sourceforge)  Free Open Source software analysis package specially developed for the needs of the molecular biology community  Provides a comprehensive set of sequence analysis programs (approximately 100+)  Current version 3.0 July 2005
  • 22. EMBOSS (programs)  Integrates other publicly available packages  Can be accessed through BioPERL modules (easy automation)  Sequence alignment  Rapid database searching with sequence patterns  Protein motif identification, including domain analysis  Nucleotide sequence pattern analysis, for example to identify CpG islands or repeats.  Codon usage analysis for small genomes  Rapid identification of sequence patterns in large scale sequence sets.  Database creation/indexing
  • 23. Interacting with EMBOSS  EMBOSS programs are run by typing them at the UNIX prompt (in your Xterminal) with or without parameters/options  EMBOSS command syntax follows normal UNIX command conventions  It will prompt you for parameters not provided when invoking the program  In doubt use: program_name -help (seqret –help) tfm program_name ( tfm seqret )  Use wossname to search a program by keyword
  • 24. Sequence Formats  Sequence Formats: • FASTA • GenBank • EMBL • SwissProt • PIR  FASTA format: >Seq_Name description and some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcagctctttgtcgcgcccagg agctacacac  IDs and Accessions • ID was human readable and name suggested functions etc, • Accession number are database assigned (now a days they are same as ID ) • ID 'hsfau' is the 'Homo Sapiens FAU pseudogene„ Its accession # X65923 (sometimes Accession.1 for version)  Multiple sequence per file  No connection between file name and ID  GFF and Reports (Covered later on)
  • 25. EMBOSS & USA  USA (Uniform Sequence Address) • "format::file" • "format::file:entry" • "dbname:entry" (we don‟t have this configured) • "@listfile" (a file of file-names; ls *.seq > mylist )  Format is not required when reading in a sequence, EMBOSS will guess the sequence format by trying all known formats until one succeeds  When writing out a sequence, EMBOSS will use fasta format by default. You can specify another format to use gcg::myresults.seq embl::myresults.out
  • 26. Additional Help With UNIX and PERL  http://uacbt.arizona.edu
  • 27. Programming Process  When asked to develop ..look around before you re invent the wheel  Requirement Analysis: What input, output, formats, source for data, frequency of update etc.  Design Phase (how and what to use)  Flow charts for (logic and data) UML, use cases http://www.uml.org/  Pseudocode get filename open file and read sequences For each sequence If length is greater then 100Kb print error msg #
  • 28. Programming …  Now start coding  Always comment your code  Use version control filename.1 etc for small project CVS http://www.cvshome.org/  Code has to be human readable but machine parseable !  Test and debug code using different scenarios for input  Don‟t feel shy to use paper and pencil ..its easier at times  http://www.eecs.wsu.edu/c/programm.htm
  • 29. Introduction to PERL  Invoking PERL Basic Input/Output STDIN, STDOUT, print and writing to files, sockets  Variables: Scalar Data Numbers 12, 12e5, -12.534 Strings “who likes Austin Powers?” Operators +, -, <, > =  Flow Control if, while, for, foreach  Arrays
  • 30. Invoking Perl  First line of a perl program: #! /usr/local/bin/perl  # by itself means comments, i.e. the line is not interpreted.  It is important to comment your code!!! #!/usr/local/bin/perl # Program by Baha Men (Sept 09,2004) print “Who let the dogs out?n”; # The above line outputs to screen # the (only) famous song by the group
  • 31. Variables  Variable is something that will store values while your program is running  You can set initial values of variables and modify these values as the program executes.  No need to pre define  Automatically get global scope *  You can store numbers, text in the variables Note the “ ” $a = 1; for text $z = $a + 3.1412653505; $b = “I put the cat out”; $gene_name = “C127899.1”;
  • 32. Arithmetic Operators  $a = 1 + 2; # Add 1 and 2 and store in $a  $a = 3 - 4; # Subtract 4 from 3 and store in $a  $a = 5 * 6; # Multiply 5 and 6  $a = 7 / 8; # Divide 7 by 8 to give 0.875  $a = 9 ** 10;# Nine to the power of 10  $a = 5 % 2; # Remainder of 5 divided by 2  $a++; # Increment $a by 1  $a--; #Decrement $a by 1  if ($a <= 2) #Lesser than or equal
  • 33. String Operator  $b = “Hello”; $c = “World”  $a = $b . $c; # Concatenate $b and $c print $a; # This is HelloWorld  We can do the same using print “$a $b from men”; # This will print Hello World from me # Followed by a newline  Difference between „ and “ covered later  n is newline this puts space between 2 lines  t is the tab operator i.e Hello World
  • 34. Testing Values  $a == $b # Is $a numerically equal to $b ? # Beware: Don't use the = operator.  $a != $b # Is $a numerically not equal to $b?  $a eq $b # Is $a string-equal to $b?  $a ne $b # Is $a string not equal to $b?  Use == for numbers and eq for strings
  • 35. Flow Control  for (initialize; test; increment) { first_action; second_action; for( $count = 0 ; $count < 10 ; $count++) etc { print “ Count is = $count n”;} }  while (condition) { while ($president ne “Nader”) first_action; second_action; { etc print “Try againn”; # Ask again } $president = <STDIN>; # Get input  if (condition) chomp $president; # Chop off newline { } first_action; second_action }
  • 36. Arrays  Array variable is a list of scalars (ie numbers and/or strings).  Same format as scalar except that they have a @ i.e @names is a array while $names is scalar  @names = (“Hillary”,”Dick”,”Ralph”); @party = (“Democrat”,”Republican”,”Green”);  Array data can be referenced by using the index number which starts from 0 $names[0] is Hillary and $party[1] is Republican  You can set values using $names[3]=“Pat”;
  • 37. I get the picture ..just get on with it !  Your first program hello.pl  Create directory prog1 save files there  Print hello world (That‟s too easy)  Ask the user for a name  Greet the user  Ask the user for password  If it matches the password yahoo then greet else boot  You can type perl hello.pl or chmod u+x hello.pl and run it ./hello.pl (remember to cd prog1)
  • 38. Your program #!/usr/local/bin/perl # Customary first line print “Please enter your name: “; # Prompts the user to type and hit enter $name = <STDIN>; chomp($name); #read from Keyboard and remove new line print “Hello $name please give me secret password:”; $password = <STDIN>; # Now compare it to hidden password if($password eq “yahoo” ) { print “Welcome buddyn”; } else { print “Bite Me: $password is invalidn”;}
  • 39. Next Class  Bring ideas/questions related to your final project  NCBI Databases  NCBI e-utils  More Perl and BioPERL

×