BITS: Introduction to Linux - Text manipulation tools for bioinformatics


Published on

This slide is part of the BITS training session: "Introduction to linux for life sciences."


Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

BITS: Introduction to Linux - Text manipulation tools for bioinformatics

  1. 1. Linux for Bioinformatics <ul><li>Navigator, </li></ul><ul><li>The Shell </li></ul><ul><li>I/O redirection & pipes </li></ul><ul><li>Text, text & text </li></ul><ul><li>Misc </li></ul>BITS/VIB Bioinformatics Training – version 2 – Joachim Jacob Okt 2011 – Luc Ducazu <>
  2. 2. Schedule <ul><li>Today we will only work with the command line. You won't be able to remember all of the tools we will see today, therefore a quick reference nearby is indispensable! </li></ul><ul><li>TIP : have some kind of notes on your computer to quickly store and find commands ! </li></ul>
  3. 3. GOAL <ul><li>The main goal: </li></ul><ul><ul><li>To help you easily use command line tools </li></ul></ul><ul><ul><li>To help you easily automate repetitive tasks </li></ul></ul><ul><ul><li>To help you easily parse/summarize outputs, which is mainly text </li></ul></ul>
  4. 4. Connecting to Linux <ul><li>Startup the your machine and log in </li></ul><ul><li>OR </li></ul><ul><li>Remote connection (e.g. on departmental server) </li></ul><ul><ul><li>$ ssh [email_address] </li></ul></ul><ul><ul><li>-> The sysadmin should have made you an account </li></ul></ul><ul><ul><li>-> you are prompted for your password </li></ul></ul><ul><li>On windows, install PuTTY to connect </li></ul><ul><li> </li></ul>
  5. 5. And there we are... username Machine name location
  6. 6. Navigation <ul><li>When you open a terminal or log in to a server, the default current working directory is your ' home directory ': /home/james </li></ul><ul><li>The prompt reflects your current working directory, however this is not always the case. To show your current working directory, you use pwd ( print working directory ): $ pwd /home/james </li></ul>
  7. 7. The File System Tree / bin boot dev etc home media root sbin tmp usr var james bin sbin share bin sbin share local lib log mail run spool tmp Aka ~ (for james) The variable PATH contains the paths of the bin folders. See command env.
  8. 8. UNIX philosophy <ul><li>'Everything is a file' : </li></ul><ul><ul><li>Commands and scripts (stored in directories named bin) </li></ul></ul><ul><ul><li>Configuration files in plain text (most in folder /etc) </li></ul></ul><ul><ul><li>Devices ( /dev ) (here USB disks, webcam, etc.) </li></ul></ul><ul><ul><li>Interaction with the Linux kernel ( /proc ) (here you can set/read system settings) </li></ul></ul>
  9. 9. Names of files & directories <ul><li>In UNIX names of files and directories are case sensitive </li></ul><ul><li>Some characters have a special meaning to the shell: spaces, |, <, >, *, ?, [, ], /, ,.,.. You can use some of them, but they have to be hidden (escaped) from the shell </li></ul><ul><li>In UNIX there is no such thing as file extensions : </li></ul><ul><ul><li>commands are marked executable via permissions </li></ul></ul><ul><ul><li>files are recognized based on content </li></ul></ul><ul><li>Files and directories share the same name space </li></ul>
  10. 10. Navigation in the shell <ul><li>You change the current working directory using cd ( change directory ): $ cd dir </li></ul><ul><li>Absolute paths start with /, relative paths don't ( relative to the current working directory) $ cd /home $ pwd /home $ cd james $ pwd /home/james </li></ul>
  11. 11. Navigation <ul><li>Shortcuts: </li></ul><ul><ul><li>navigate to your home directory $ cd $ cd ~ </li></ul></ul><ul><ul><li>navigate up the file system tree $ cd .. </li></ul></ul><ul><ul><li>navigate to the previous current working directory $ cd - </li></ul></ul><ul><li>Example: go two directories up: </li></ul><ul><ul><li>$ cd ../.. </li></ul></ul>
  12. 12. Navigation <ul><li>TIP </li></ul><ul><ul><li>- store the current directory where you are: </li></ul></ul><ul><ul><li>$ pushd . </li></ul></ul><ul><ul><li>View the current and the stored directories </li></ul></ul><ul><ul><li>$ dirs </li></ul></ul><ul><ul><li>Go back to the stored directory </li></ul></ul><ul><ul><li>$ popd </li></ul></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><li>joachim@joalap:~$ pushd . </li></ul></ul><ul><ul><li>~ ~ </li></ul></ul><ul><ul><li>joachim@joalap:~$ cd /opt </li></ul></ul><ul><ul><li>joachim@joalap:/opt$ dirs </li></ul></ul><ul><ul><li>/opt ~ </li></ul></ul><ul><ul><li>joachim@joalap:/opt$ popd </li></ul></ul><ul><ul><li>~ </li></ul></ul><ul><ul><li>joachim@joalap:~$ pwd </li></ul></ul><ul><ul><li>/home/joachim </li></ul></ul>
  13. 13. Exercise <ul><li>Suppose you are logged in as user james . Navigate to the following waypoints, in the given order, via the shortest route /usr/local /usr/local/bin /usr/local/share/man /usr/local/bin /home/james/Documents /root </li></ul><ul><li>/usr/local/bin </li></ul>
  14. 14. Exploration <ul><li>To show the content of a given directory: $ ls </li></ul><ul><li>$ ls -l or alias </li></ul><ul><li>$ ll </li></ul><ul><li>The file type in the output of ls -l : </li></ul><ul><li>$ ls -l </li></ul><ul><li>- rwxr-xr--. 1 james users 357 Sep 5 21:36 clusterit.gz </li></ul><ul><ul><li>- : ordinary file </li></ul></ul><ul><ul><li>d : directory </li></ul></ul><ul><ul><li>l : symbolic link count </li></ul></ul>
  15. 15. Exploration <ul><li>The tool you use to identify files, based on their content: $ file file(s) </li></ul><ul><li>Example: $ ls unix-history zless $ file * unix-history: PNG image data, 1000 x 636, 8-bit/color RGBA, non-interlaced zless: POSIX shell script text executable </li></ul>
  16. 16. Exploration <ul><li>Tools for viewing text files: </li></ul><ul><ul><li>Cat (show content in terminal, at once) </li></ul></ul><ul><ul><li>less , more (page by page) </li></ul></ul><ul><ul><li>head , tail (view only first/last lines) </li></ul></ul><ul><li>Tools for viewing binary files: </li></ul><ul><ul><li>strings , hexdump </li></ul></ul><ul><ul><li>file specific viewers (images, PDF) </li></ul></ul>
  17. 17. Exploring text files <ul><li>To show the content of one or more text files: $ cat file(s) </li></ul><ul><li>To show the content rather page by page: $ more file(s) $ less file(s) more is not as flexible as less , but it is a universal UNIX utility </li></ul><ul><li>To show the first / last nn lines of a text file: $ head - nn file $ tail - nn file Default number of lines is 10 </li></ul>
  18. 18. Conquest of the file system <ul><li>Working with directories: </li></ul><ul><ul><li>Navigation: cd , pwd </li></ul></ul><ul><ul><li>Manipulation: mkdir , rmdir </li></ul></ul><ul><li>Working with files: </li></ul><ul><ul><li>Creating files: touch , nano </li></ul></ul><ul><ul><li>Removing files: rm </li></ul></ul><ul><ul><li>Copying files: cp </li></ul></ul><ul><ul><li>Moving / renaming files: mv </li></ul></ul>
  19. 19. Creating directories <ul><li>To create one or more directories: $ mkdir options dir(s) </li></ul><ul><li>Interesting options: </li></ul><ul><ul><li>m : mode – permissions of the new directory </li></ul></ul><ul><ul><li>p : parent – create all subparts </li></ul></ul>
  20. 20. Creating directories <ul><li>Suppose you want to create directory /tmp/lvl1/lvl2 : $ mkdir /tmp/lvl1/lvl2 mkdir: cannot create directory ... </li></ul><ul><li>One possible solution: $ mkdir /tmp/lvl1 /tmp/lvl1/lvl2 </li></ul><ul><li>A more elegant solution: $ mkdir -p /tmp/lvl1/lvl2 </li></ul><ul><li>Example 2: </li></ul><ul><li>$ mkdir -p cgi/{data,src,ref,local/{bin/cgatools,share/cgatools-1.4.0/doc}} </li></ul><ul><li>$ tree cgi </li></ul>
  21. 21. Removing directories <ul><li>To remove one or more directories: $ rmdir options dir(s) </li></ul><ul><li>rmdir removes empty directories only (are there any hidden files left ?) </li></ul><ul><li>You can remove complete subtrees using: $ rm -rf dir ! Pay attention when you execute this command as root – UNIX is not particularly merciful. </li></ul>
  22. 22. Creating a file <ul><li>To create one or more files: $ touch file(s) </li></ul><ul><li>When file does not yet exist, it is created: </li></ul><ul><ul><li>ordinary file </li></ul></ul><ul><ul><li>owner is the user that enters the command </li></ul></ul><ul><ul><li>group is the primary group of the owner </li></ul></ul><ul><ul><li>permissions depend on umask </li></ul></ul><ul><ul><li>0 bytes file size </li></ul></ul><ul><li>When the file already exists, its access , modification en change time are updated. </li></ul>
  23. 23. Creating a file <ul><li>You can use a text editor as an alternative way of creating files </li></ul><ul><li>Terminal: </li></ul><ul><ul><li>vi , emacs </li></ul></ul><ul><ul><li>pico , nano </li></ul></ul><ul><ul><li>joe </li></ul></ul><ul><li>GUI </li></ul><ul><ul><li>gvim , xemacs </li></ul></ul><ul><ul><li>gedit </li></ul></ul><ul><ul><li>geany </li></ul></ul>
  24. 24. Copying files <ul><li>To copy file(s) from src to dst : $ cp options src dst ! Both arguments, src and dst , are mandatory </li></ul><ul><li>When src is a single file , dst can be either a directory or the name of a (possibly existing) file : $ cp /etc/passwd . $ cp /etc/passwd /tmp/userdb </li></ul><ul><li>When src is a collection of files, dst must be a directory: $ cp * /tmp </li></ul>
  25. 25. Copying files <ul><li>Some interesting options: </li></ul><ul><ul><li>v: be verbose </li></ul></ul><ul><ul><li>i: interactive – asks whether existing files may be overwritten or not </li></ul></ul><ul><ul><li>f: force – no questions asked </li></ul></ul><ul><ul><li>r/R: recursion – dirs and copy subdirectories as well </li></ul></ul><ul><ul><li>p: preserve permissions </li></ul></ul><ul><ul><li>a: archive - combines a few options like –p and –r </li></ul></ul><ul><li>Example: $ cp -r cgi /opt </li></ul>
  26. 26. Moving / renaming files <ul><li>To move file(s) from src to dst : $ mv options src dst ! Both arguments, src and dst , are mandatory </li></ul><ul><li>When src is a single file , dst can be either a directory or the name of a (possibly existing) file: $ mv /tmp/download/clusterit.tgz . $ mv clusterit2.3.tar.gz clusterit.tgz </li></ul><ul><li>When src is a collection of files, dst must be a directory: $ mv * /tmp </li></ul>
  27. 27. Removing files <ul><li>To remove file(s) : $ rm options file(s) </li></ul><ul><li>Interesting options: </li></ul><ul><ul><li>v: be verbose </li></ul></ul><ul><ul><li>i: interactive – asks whether or not you are really, really sure about this command </li></ul></ul><ul><ul><li>f: force – no questions asked </li></ul></ul><ul><ul><li>r: recursive - delete subdirectories as well </li></ul></ul><ul><li>Complete subtrees can be removed like this: $ rm -rf dir </li></ul>
  28. 28. Remotely copying files <ul><li>If you are logged in on another machine, e.g. </li></ul><ul><ul><li>$ ssh [email_address] </li></ul></ul><ul><li>And you are working there: creating folders, files, running programs, you can copy to your own machine using scp </li></ul><ul><ul><li>$ scp [email_address] :/home/bits/sample.sam . </li></ul></ul>Command to copy username machine Path to the remote folder / file you want to copy To the local current dir
  29. 29. Exercise <ul><li>Get the TAIR9_mRNA.bed file from as in previous example. Copy also the recut file from there. Your account: bits , password: b!t$fortraining </li></ul><ul><li>The file sample.sam is available on our website, under following link: </li></ul><ul><li>Search a command to download directly in the terminal (hint: use apropos). </li></ul><ul><li>Create a folder bioinfo/data and bioinfo/bin and move recut to bin and bed/sam files to data </li></ul>
  30. 30. Solutions <ul><ul><li>$ scp [email_address] :/home/bits/TAIR9_mRNA.bed . </li></ul></ul><ul><ul><li>$ scp [email_address] :/home/bits/recut . </li></ul></ul><ul><ul><li>$ apropos download </li></ul></ul><ul><ul><li>$ wget </li></ul></ul><ul><ul><li>$ mkdir -p bioinfo/{data,bin} </li></ul></ul><ul><ul><li>$ mv *.sam bioinfo/data </li></ul></ul><ul><ul><li>$ mv *.bed bioinfo/data </li></ul></ul><ul><ul><li>$ mv recut bioinfo/bin </li></ul></ul>
  31. 31. Exercise <ul><li>Create the file /tmp/me using an editor (eg nano ) with the following content: </li></ul><ul><li>#!/bin/sh echo No-one messes with $USER ! </li></ul><ul><li>What kind of file is this? What could you do with this file? </li></ul><ul><li>Create directory ex in your home directory </li></ul><ul><li>Copy the file /tmp/me into this directory </li></ul><ul><li>Verify its content </li></ul><ul><li>Remove the file /tmp/me </li></ul><ul><li>Remove directory ~/ex </li></ul>
  32. 32. Executables <ul><li>Come in two flavours: </li></ul><ul><ul><li>Scripts </li></ul></ul><ul><ul><li>Binaries </li></ul></ul><ul><li>Execute permissions must be set, mostly 755 will do. </li></ul><ul><li>Scripts mostly start with the shebang line, telling the shell with interpreter to use. E.g. </li></ul><ul><ul><li>#!/usr/bin/perl </li></ul></ul><ul><li>Executing of executables (with prg the file name) </li></ul><ul><ul><li>$ ./ prg </li></ul></ul><ul><ul><li>$ bash </li></ul></ul><ul><ul><li>$ perl prg .pl </li></ul></ul>
  33. 33. Exercise <ul><li>The file recut you have downloaded is a program. Look at the first lines of the program. </li></ul><ul><li>Check with the file command which file it is. </li></ul><ul><li>Set the permissions so the file is executable: you can do this graphically, or by typing: </li></ul><ul><li>$chmod a+x recut </li></ul><ul><li>Above syntax translated: 'change modus ( chmod ) so that everybody ( a ) gets ( + ) execute permissions ( x ) on the file ( recut ) </li></ul><ul><li>Execute the program. </li></ul><ul><li>Can you check with the file command the file /bin/ls? </li></ul>
  34. 34. Advanced <ul><li>When you log in on another linux box, all processes (programs) you start are are terminated upon closing the connection (… by accident). </li></ul><ul><li>To avoid this, you can use 'no hang up' or screen: </li></ul><ul><li>$ nohup prg -options argument & </li></ul><ul><li>or </li></ul><ul><li>$ screen </li></ul><ul><li>$ prg -options argument </li></ul><ul><li>Howto for screen: </li></ul><ul><li> </li></ul>
  35. 35. Which programs are running
  36. 36. Which programs are running
  37. 37. Which programs are running
  38. 38. Which programs are running
  39. 39. Which programs are running
  40. 40. I/O redirection of terminal programs <ul><li>When a program is launched, 3 channels are opened: </li></ul><ul><ul><li>stdin : an input channel (default keyboard) </li></ul></ul><ul><ul><li>stdout : channel used for functional output (*) (screen) </li></ul></ul><ul><ul><li>stderr : channel used for error reporting (*) (screen) </li></ul></ul><ul><li>In UNIX, open files have an identification number called a file descriptor </li></ul><ul><ul><li>0 -> stdin </li></ul></ul><ul><ul><li>1 -> stdout </li></ul></ul><ul><ul><li>2 -> stderr </li></ul></ul><ul><li>(*) by convention </li></ul><ul><li> </li></ul>
  41. 41. I/O redirection <ul><li>Under default circumstances, these channels are connected to the terminal on which the program was launched </li></ul><ul><li>The shell offers a possibility to redirect any of these channels to: </li></ul><ul><ul><li>a file </li></ul></ul><ul><ul><li>a device </li></ul></ul><ul><ul><li>another program (pipe) </li></ul></ul>
  42. 42. I/O redirection <ul><li>When cat is launched without any arguments, the program reads from stdin (keyboard) and writes to stdout (terminal) </li></ul><ul><li>Example: $ cat type: DNA: National Dyslexia Association ↵ result: DNA: National Dyslexia Association You can stop the program using the ' End Of Input ' character CTRL-D </li></ul>
  43. 43. Input redirection <ul><li>A programs ' stdin ' input can be connected to a file (or device), by doing: $ cat 0< file or short: $ cat < file </li></ul><ul><li>or even shorter: $ cat file </li></ul><ul><li>Example: </li></ul><ul><ul><li>$ grep '@' < sample.sam </li></ul></ul><ul><ul><li>$ mail -s Goodbye < C4.txt </li></ul></ul>Emailing tool Option to set the subject recipient Content is read from file
  44. 44. Output redirection <ul><li>The ' stdout ' output of a program can be saved to a file (or device): $ cat 1> file or short: $ cat > file </li></ul><ul><li>Example: # ls -lR / > /tmp/ls-lR </li></ul><ul><li># less /tmp/ls-lR </li></ul>
  45. 45. Output redirection <ul><li>IMPORTANT, if you write to a file, the contents are being replaced by the output. </li></ul><ul><li>To append to file , you use: $ cat 1>> file or short $ cat >> file </li></ul>
  46. 46. Error redirection <ul><li>The ' stderr ' output of a program can be saved to a file (or device): Create or truncate file : $ cat 2> file Append to file : $ cat 2>> file </li></ul>
  47. 47. Special devices <ul><li>For input: </li></ul><ul><ul><li>/dev/zero all zeros </li></ul></ul><ul><ul><li>/dev/urandom (pseudo) random numbers </li></ul></ul><ul><li>For output: </li></ul><ul><ul><li>/dev/null 'bit-heaven' </li></ul></ul><ul><li>Example: You are not interested in the errors from the command cmd : $ cmd 2> /dev/null </li></ul>
  48. 48. Playing with out- and input: pipes <ul><li>The output of one program can be fed as input to another program </li></ul><ul><li>Example: </li></ul><ul><li>$ ls -lR ~ > /tmp/ls-lR </li></ul><ul><li>$ less /tmp/ls-lR ('Q' to quit less) </li></ul><ul><li>can be shortened to: $ ls -lR ~ | less </li></ul><ul><li>The stdout channel of ls is connected to the stdin channel of less </li></ul><ul><li>You are not restricted to 2 programs, a pipe can span many programs, each separated by | </li></ul>
  49. 49. Compression of files <ul><li>Widely used compression tools: </li></ul><ul><ul><li>GNU zip ( gzip ) </li></ul></ul><ul><ul><li>Block Sorting compression ( bzip2 ) </li></ul></ul><ul><li>Typically, compression tools work on one file! </li></ul><ul><li>That's why first an archive is create with tar and this archive is compressed. </li></ul>
  50. 50. tar <ul><li>Tape Archive is a tool you use </li></ul><ul><ul><li>to bundle a set of files into a single archive - ideal for data exchange </li></ul></ul><ul><ul><li>to extract files from a tar ball </li></ul></ul><ul><li>Syntax to create a tar $ tar -cf archive.tar file1 file2 </li></ul><ul><li>Syntax to extract $ tar -xvf /path/to/archive.tar </li></ul><ul><li>Options: </li></ul><ul><ul><li>x : extract archive </li></ul></ul><ul><ul><li>v : be verbose (show file names) </li></ul></ul><ul><ul><li>f : specify the archive file (- for stdin ) </li></ul></ul>
  51. 51. Compression <ul><li>To compress one or more files: $ gzip [ options ] file $ bzip2 [ options ] file </li></ul><ul><li>Options: </li></ul><ul><ul><li>c: send output to stdout instead of overwriting the specified file(s) </li></ul></ul><ul><ul><li>1 or --fast: fast / minimal compression </li></ul></ul><ul><ul><li>9 or --best: slow / maximal compression </li></ul></ul><ul><li>Standard extensions: </li></ul><ul><ul><li>gzip .gz </li></ul></ul><ul><ul><li>bzip2 .bz2 </li></ul></ul>
  52. 52. Decompression <ul><li>To decompress one or more files: $ gunzip [ options ] file(s) $ bunzip2 [ options ] file(s) </li></ul><ul><li>To decompress a tar.gz or tar.bz2 </li></ul><ul><li>$ tar xvfz file.tar.gz </li></ul><ul><li>$ tar xvfj file.tar.bz2 </li></ul><ul><li>The following tools can read directly from gzip or bzip2 files (*.bz2 or *.gz) $ zcat file(s) $ bzcat file(s) </li></ul>
  53. 53. Text tools <ul><li>UNIX has a liberal use for text files: </li></ul><ul><ul><li>Databases: users, groups, hosts, services </li></ul></ul><ul><ul><li>Configuration files </li></ul></ul><ul><ul><li>Log files </li></ul></ul><ul><ul><li>Many commands are scripts (shell, perl, python) </li></ul></ul><ul><li>UNIX has an extensive toolkit for text extraction, reporting and manipulation: </li></ul><ul><ul><li>Extraction: head , tail , grep , awk , uniq </li></ul></ul><ul><ul><li>Reporting: wc </li></ul></ul><ul><ul><li>Manipulation: dos2unix , sort , tr , sed </li></ul></ul><ul><li>The UNIX text tool: perl </li></ul>
  54. 54. Exchanging text files <ul><li>UNIX and Windows differ in the ways line endings are marked. Sometimes text tools can get confused by Windows line endings. </li></ul><ul><li>To convert between the two formats, use: $ dos2unix file or $ dos2unix -n dosfile unixfile </li></ul><ul><li>In the first case, file is overwritten. </li></ul><ul><li>In the second case a new ( -n ) file unixfile is created, leaving dosfile untouched </li></ul>
  55. 55. Regular expressions <ul><li>A Regular expression , aka regex or re , is a formal way of describing sets of strings </li></ul><ul><li>Many UNIX tools use regular expressions, including perl , grep and sed </li></ul><ul><li>The topic itself is beyond the scope of this introduction to Linux, but the gentle reader is strongly encouraged to read more about this subject. Here is a starter: $ man 7 regex </li></ul><ul><li>To keep things manageable, we will only use literal strings as regexes </li></ul>
  56. 56. grep <ul><li>grep is used to extract lines from an input stream that match (or don't match) a regular expression </li></ul><ul><li>Syntax: $ grep [ options ] regex [ file(s) ] </li></ul><ul><li>The file(s) (or if omitted stdin ) are read line by line. If the line matches the given criteria, the entire line is written to stdout. </li></ul>
  57. 57. grep <ul><li>Interesting options: </li></ul><ul><ul><li>i : ignore case match the regex case insensitively </li></ul></ul><ul><ul><li>v : inverse show all lines that do not match the regex </li></ul></ul><ul><ul><li>l : list </li></ul></ul><ul><ul><li>show only the name of the files that contain a match </li></ul></ul><ul><ul><li>n : shows n lines that precede and follow the match </li></ul></ul><ul><ul><li>color : </li></ul></ul><ul><ul><li>highlights the match </li></ul></ul>
  58. 58. grep <ul><li>Get james' record in the user database: $ grep james /etc/passwd james:x:500:100:James Watson: /home/james:/bin/bash </li></ul><ul><li>Which files in /etc contain the string 'PS1': $ grep -l PS1 /etc/* 2> /dev/null /etc/bashrc /etc/bashrc.rpmnew /etc/rc.sysinit </li></ul>
  59. 59. Exercise <ul><li>Go to . On the main page you find a link 'Download Human genome sequence'. </li></ul><ul><li>Download Homo_sapiens.GRCh37.64.dna.chromosome.21.fa.gz (note the extension!!) </li></ul><ul><li>Save and extract it in your folder bioinfo/data. You might have to set the permissions to rw-r-r- by typing chmod a+r Homo* . </li></ul><ul><li>Count how many fasta sequences are in that file to check if we have only one and display the name. </li></ul><ul><li>Check if the tool 'screen' is installed with YUM </li></ul>
  60. 60. Solution <ul><li>$ grep '>' Homo_sapiens.GRCh37.64.dna.chromosome.21.fa </li></ul><ul><li>$ yum list installed | grep screen </li></ul>
  61. 61. Word Count <ul><li>A general tool for counting lines, words and characters: wc [ options ] file(s) </li></ul><ul><li>Interesting options: </li></ul><ul><ul><li>c : number of characters </li></ul></ul><ul><ul><li>w : number of words </li></ul></ul><ul><ul><li>l : number of lines </li></ul></ul><ul><li>Example: How many packages are installed? $ yum list installed | wc -l </li></ul>
  62. 62. Transform <ul><li>To manipulate individual characters in an input stream: $ tr 's1' 's2' </li></ul><ul><li>! tr always reads from stdin – you cannot specify any files as command line arguments </li></ul><ul><li>Characters in s1 are replaced by characters in s2 </li></ul><ul><li>The result is written to stdout </li></ul><ul><li>Example: $ echo 'James Watson' | tr '[a-z]' '[A-Z]' JAMES WATSON </li></ul>
  63. 63. Transform <ul><li>To remove a particular set of characters: $ tr -d 's1' </li></ul><ul><li>Deletes all characters in s1 </li></ul><ul><li>Reads from stdin , writes to stdout </li></ul><ul><li>Example: $ tr –d 'r' < DOStext > UNIXtext </li></ul>
  64. 64. awk <ul><li>awk is an extraction and reporting tool, works very well with tabular data </li></ul><ul><li>It reads in your file, chops on white spaces creating fields, and let you specify which fields to output </li></ul><ul><li>Excellent documentation: </li></ul>
  65. 65. awk (1) <ul><li>Extraction of one or more fields in a tabular data stream: awk -F delim '{ print $ x }' </li></ul><ul><li>Here is </li></ul><ul><ul><li>F delim the field separator (default is white space) </li></ul></ul><ul><ul><li>$ x the field number: </li></ul></ul><ul><ul><ul><li>$0: the complete line </li></ul></ul></ul><ul><ul><ul><li>$1: first field </li></ul></ul></ul><ul><ul><ul><li>$2: second field </li></ul></ul></ul><ul><ul><ul><li>… </li></ul></ul></ul><ul><ul><li>NF is the cumber of fields (can also be taken for last field). </li></ul></ul><ul><ul><li>Note: calculations can be done between { } with $x </li></ul></ul>
  66. 66. awk (1) <ul><li>Examples: Extract owner and name of a file: $ ls -l | awk '{ print $3, $9 }' </li></ul><ul><li>Show all users and their UID $ awk -F: '{ print $3, $1 }' /etc/passwd </li></ul><ul><li>Show all Arabidopsis mRNA with more than 50 exons </li></ul><ul><li>$ awk '{ if ($10>50) print $4 }' TAIR9_mRNA.bed </li></ul>
  67. 67. awk (2) <ul><li>Extraction of one or more fields from a tabular data stream of lines that match a given regex : awk -F delim '/ regex / { print $ x }' </li></ul><ul><li>Here is: </li></ul><ul><ul><li>regex : a regular expression </li></ul></ul><ul><ul><li>the awk script is executed only if the line matches regex </li></ul></ul><ul><ul><li>lines that do not match regex are removed from the stream </li></ul></ul>
  68. 68. awk (2) <ul><li>Example: print number of exons of mRNAs from first chromosomes: $ awk '/chr1/ {print $1,$10}' TAIR9_mRNA.bed </li></ul>
  69. 69. cut <ul><li>A similar tool is cut , it extracts fields from fixed text file formats only: </li></ul><ul><ul><li>fixed width $ cut -c LIST [ file ] </li></ul></ul><ul><ul><li>fixed delimiter $ cut [-d delim ] -f LIST [ file ] </li></ul></ul><ul><li>For LIST: </li></ul><ul><ul><li>N : the Nth element </li></ul></ul><ul><ul><li>N-M : element the Nth till the Mth element </li></ul></ul><ul><ul><li>N- : from the Nth element on </li></ul></ul><ul><ul><li>-M : till the Mth element </li></ul></ul><ul><li>The first element is 1 </li></ul>
  70. 70. cut <ul><li>Fixed width example: </li></ul><ul><ul><li>Suppose there is a file fixed.txt with content 12345ABCDE67890FGHIJ </li></ul></ul><ul><ul><li>To extract a range of characters: $ cut -c 6-10 fixed.txt ABCDE </li></ul></ul>
  71. 71. cut <ul><li>Fixed delimiter example: </li></ul><ul><ul><li>Default delimiter is TAB </li></ul></ul><ul><ul><li>To extract the UID, account and GECOS fields from /etc/passwd : $ cut -d: -f 3,1,5 /etc/passwd root:0:root ... ! Note the output order. </li></ul></ul><ul><ul><li>The commands below give exactly the same result: </li></ul></ul><ul><ul><li>$ cat /etc/passwd | tr ':' 't' | cut -f 3,1,5 --output-delimiter ':' root:0:root ... </li></ul></ul>
  72. 72. sort <ul><li>To sort alphabetically or numerically lines of text: $ sort [ options ] file(s) </li></ul><ul><li>When one or more file(s) are specified, they are read one by one, but all lines are sorted. </li></ul><ul><li>The output is written to stdout </li></ul><ul><li>When no file(s) arguments are given, sort reads input from stdin </li></ul>
  73. 73. sort <ul><li>Interesting options: </li></ul><ul><ul><li>n : sort numerically </li></ul></ul><ul><ul><li>f : fold – case-insensitive </li></ul></ul><ul><ul><li>r : reverse sort order </li></ul></ul><ul><ul><li>t s : use s as field separator (instead of space) </li></ul></ul><ul><ul><li>k n : sort on the n -th field (1 being the first field) </li></ul></ul>
  74. 74. sort <ul><li>Examples: Sort mRNA by chromosome number and next by number of exonse $ sort -n -k1 -k10 TAIR9_mRNA.bed </li></ul><ul><li>> out.bed </li></ul>
  75. 75. uniq <ul><li>This tool allows you to: </li></ul><ul><ul><li>eliminate duplicate lines in a set of files </li></ul></ul><ul><ul><li>display unique lines </li></ul></ul><ul><ul><li>display and count duplicate lines </li></ul></ul><ul><li>! uniq always starts from sorted input </li></ul>
  76. 76. Eliminate duplicates <ul><li>To eliminate duplicate lines: $ uniq file(s) </li></ul><ul><li>Example: $ who root tty1 Oct 16 23:20 james tty2 Oct 16 23:20 james pts/0 Oct 16 23:21 james pts/1 Oct 16 23:22 james pts/2 Oct 16 23:22 $ who | awk '{print $1}' | sort | uniq james root </li></ul>
  77. 77. Display unique or duplicate lines <ul><li>To display lines that occur only once: </li></ul><ul><li>$ uniq -u file(s) </li></ul><ul><li>To display lines that occur more than once: $ uniq -d file(s) </li></ul><ul><li>Example: $ who|awk '{print $1}'|sort|uniq -d james </li></ul><ul><li>To display the counts of the lines </li></ul><ul><li>$ uniq -c file(s) </li></ul><ul><li>Example </li></ul><ul><li>$ who | awk '{print $1}'|sort|uniq -c 4 james </li></ul><ul><li>1 root </li></ul>!
  78. 78. Comparing text files <ul><li>To find differences between two text files: $ diff [ options ] file1 file2 </li></ul><ul><li>Example: difference between two genbank versions of LOCUS CAA98068 # diff 1,2c1,3 < LOCUS CAA98068 445 aa linear INV 27-OCT-2000 < DEFINITION ZK822.4 [Caenorhabditis elegans]. --- > LOCUS CAA98068 453 aa linear INV 09-MAY-2010 > DEFINITION C. elegans protein ZK822.4, confirmed by transcript evidence > [Caenorhabditis elegans]. 4,5c5,6 < VERSION CAA98068.1 GI:3881817 < DBSOURCE embl locus CEZK822, accession Z73898.1 --- > VERSION CAA98068.2 GI:14530708 > DBSOURCE embl accession Z73898.1 </li></ul>
  79. 79. Comparing text files <ul><li>There exists a text tool, sdiff , that compares two files side by side . </li></ul><ul><li>However, when visualizing data, one is far better off using graphical tools. Visual diff: meld </li></ul>
  80. 80. Comparing files – GUI (meld)
  81. 81. Exercise <ul><li>List the contents of directory ~/bioinfo and all subdirectories </li></ul><ul><li>Repeat this command, but save the content to /tmp/bioinfo.list </li></ul><ul><li>List the contents of /tmp/proc.list. Can you get rid of the errors? </li></ul>
  82. 82. Exercises <ul><li>File sample.sam is a tab delimited file. See on the BITS wiki for a description of .sam format . </li></ul><ul><ul><li>How many lines has this file </li></ul></ul><ul><ul><li>How many start with the comment sign @ </li></ul></ul><ul><ul><li>Provide a summary of the FLAG field (second field): the FLAG and the number of times counted </li></ul></ul><ul><ul><li>Can you give the above sorted on number of times observed </li></ul></ul><ul><li>File TAIR9_mRNA.bed is also a sorted file. </li></ul><ul><ul><li>How many different genes are in the file </li></ul></ul><ul><ul><li>By using head you can see that the fourth column contains 0. Is there another number in that column? </li></ul></ul><ul><ul><li>Give the 10 genes with longest CDS in Arabidopsis </li></ul></ul><ul><ul><li>Tips: </li></ul></ul>
  83. 83. Solution <ul><li>awk '{print $4,&quot;,&quot;,$11}' TAIR9_mRNA.bed | </li></ul><ul><li>tr -d ' ' | awk -F, '{s=0; for(i=2;i<NF;i++) s+=$i; print $1,s' | sort -r -n -k2 | head -10 </li></ul><ul><li>The result can be found on the server, lastresult.txt </li></ul>
  84. 84. links <ul><li> </li></ul><ul><li> </li></ul><ul><li>Started a big process: to run in background: ctrl+Z and type bg; bring it back fg; jobs </li></ul>
  85. 85. Linux <ul><li>Put the fun back into computing </li></ul>