1. 1/13/15
1
Next-Generation Sequencing Analysis Series
January 14, 2015
Andrew Oler, PhD
High-throughput Sequencing Bioinformatics Specialist
BCBB/OCICB/NIAID/NIH
BCBB instructors for this NGS series
Andrew Oler Vijay Nagarajan Mariam Quiñones
2
Bioinformatics and Computational
Biosciences Branch
NIH/NIAID/OD/OSMO/OCICB
Contact BCBB at
ScienceApps@niaid.nih.gov
Contact HPC Cluster team at:
Cluster_support@niaid.nih.gov
2. 1/13/15
2
Bioinformatics and Computational
Biosciences Branch
§ Bioinformatics Software
Developers
§ Computational Biologists
§ Project Managers &
Analysts
http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx
3
Objectives
When you leave today, I hope you will be able to
1. Open a terminal and know how to navigate
2. Know how to do basic file manipulation and create files and
directories from the command line
3. Submit a job to the HPC cluster
To accomplish these goals, we will
1. Learn the most useful Unix terminal commands
2. Practice a few of these commands
3. Practice preparing and submitting some scripts to the NIAID
HPC Cluster
Caveat:
1. You may not be a Unix expert when you leave today (and
that’s okay).
4
3. 1/13/15
3
Anatomy of the Terminal, “Command Line”,
or “Shell”
Prompt (computer_name:current_directory username)
Cursor
Command Argument
Window
Output
Mac: Applications -> Utilities -> Terminal
Windows: Download open source software
PuTTY http://www.chiark.greenend.org.uk/~sgtatham/putty/
Other SSH Clients (http://en.wikipedia.org/wiki/Comparison_of_SSH_clients)
6
4. 1/13/15
4
File Manager/Browser by Operating System
7
OS: Windows Mac OSX Unix
FM: Explorer Finder Shell
Typical UNIX directory structure
/
“root”
/bin
essential binaries
/etc
system config
/home
user directories
/home/USER1
USER1 home
/home/USER2
USER2 home
/mnt
network drives
/sbin
system binaries
/usr
shared, read-only
/usr/bin
other binaries
/usr/local
installed packages
/usr/local/bin
installed binaries
/var
variable data
/var/tmp
program caches
pwd “print working directory”; tells where you are
8
5. 1/13/15
5
How to execute a command
command argument
output
output
9
Some basic Unix commands
§ pwd
§ ls
§ mkdir
§ cd
§ wget
§ curl
§ cp
§ wc
§ head
§ tail
§ less
§ cat
§ **See Pre-lecture worksheet.**
10
6. 1/13/15
6
Tips to make life easier!
Tab completion: hit Tab to make computer guess your filename.
type: ls unix[Tab]
result: ls unix_hpc
If nothing happens on the first tab, press tab again…
Up Arrow: recall the previous command(s)
Ctrl+a go to beginning of line
Ctrl+e go to end of line
Ctrl+c kill current running process in terminal
Aliases (put in ~/.bashrc file … see handout)
alias ls='ls -AFG'
alias ll='ls -lrhT'
history show every command issued during the session
!ls repeat the previous “ls” command
!! repeat the previous command
man [command] read the manual for the command
man ls read the manual for the ls command
11
Accessing the NIAID HPC
§ Login to HPC “submit node,” which is the computer from which you submit jobs.
ssh secure shell, remote login
ssh ngscXXX@hpc.niaid.nih.gov fill in XXX with number
§ Copy
files
to/from
HPC
scp secure copy to remote location
scp -r ~/data/dir username@hpc.niaid.nih.gov:~/data/
§ ssh
and
scp
will
prompt
you
to
enter
your
password
12
7. 1/13/15
7
mv (“move file”)
mv file1 temp/ move “file1” to the “temp” directory
mv file1 file2 rename “file1” to “file2”
mv -i file2 temp/file3 move “file2” to the “temp” directory and
rename it “file3”; ask to make sure
*without -i, it will overwrite an existing file!*
mv *.fastq ~ move all “.fastq” files to the home directory
Exercise 1:
mv *.fastq temp/ (moveall “.fastq” files to the “temp” directory)
ls temp (check that the files are there)
Note: syntax for mv and cp are similar
13
rm (“remove file”)
rm file1 delete “file1”
rm -i file2 delete “file2”, but ask first
rm *.pdb delete all “.pdb” files
rm -r temp delete the “temp” directory
rm -rf temp delete the “temp” directory, no questions asked!
Be careful!
rm -r *
14
8. 1/13/15
8
File and system information
wc file1 “word count”; output is “lines”, “words”, characters”
wc *.fastq “word count” of all fastq files, including summary
du -h temp “disk usage” (size) of each file in the “temp”
directory (outputs a list)
top report for local machine on the processes using the
most system resources (memory, CPU, etc.); “q” to
exit
15
File compression
gzip temp/* compress every file in “temp”;
adds .gz extension
gunzip temp/*.gz expand every “gzipped” file in
“temp”
tar -zcvf myfiles.tar.gz temp/* create a single archive of
every file in “temp”
tar –xvf test_data.tar.gz copy every file out of the archive
“tarball” ≠
16
9. 1/13/15
9
File manipulation
cat file1 file2 > file3 write “file3”, containing first “file1”,
then “file2”
cat file1 >> file2 append “file1” onto “file2”
sort file1 alphabetize “file1.txt”
sort -n file1 sort “file1” by number
sort -n -r -k 2 file1 sort “file1” by the second word or
column in reverse numerical order
Careful
!
17
grep (search within files)
grep key file* report the file name and line where “key” appears in file*
grep -v key file* report the file names of files that do not match “key”
man grep see other functions of grep. (lots! regular expressions!)
18
10. 1/13/15
10
Linking files (making “shortcuts”)
ln -s ~/myapp/binary ~/bin make a shortcut (“symbolic link”)
in “~/bin” that points to “~/myapp/
binary”
ln -s /usr/local data make a shortcut in current
directory pointing to /usr/local
cd data takes you to /usr/local
19
Downloading Files
wget download multiple files from ftp or http address
curl download single files from ftp, http, sftp, etc.
http://curl.haxx.se/docs/comparison-table.html
20
11. 1/13/15
11
Pipelining
ls | wc count the number of files in a directory
grep | sort > file1 pull out searched-for lines, sort them, and
write a new file
Exercise 2:
head –n 2000 lymph1k.fastq | gzip > head2K.txt.gz
21
Loops
for assign a variable for each of a space-separated list of values
; Use to separate commands
do done Marks start and end of loop to repeat
Exercise 3:
for i in 1 13 200; do echo $i; done
1
13
200
ls
for i in file*; do echo $i; mv “$i” “${i}.txt”; done
file1
file2
ls
22
12. 1/13/15
12
Recommended Reading
Linux in a Nutshell, Sixth
Edition
Ellen Siever, Stephen Figgins,
Robert Love, Arnold Robbins
Running Linux, 5th
Edition
Matthias Kalle
Dalheimer, Matt Welsh
UNIX® Shells by
Example,
Fourth Edition
Ellie Quigley
23
Take Away
Use
mnemonics
Read
“man”
pages
Work
on
copies,
make
backups,
and
use
“rm”,
“mv”,
and
“>”
carefully
Pick
a
text
editor
and
master
it
(pico/nano,
emacs,
vi/vim,
etc.)
Be
clever!
Questions?
24
13. 1/13/15
13
Using NIAID Grid Engine Cluster
25
High Performance Computing
§ “A computer cluster consists of a set of loosely connected
computers that work together so that in many respects they can
be viewed as a single system.”
http://en.wikipedia.org/wiki/Cluster_%28computing%29
26
14. 1/13/15
14
HPC Glossary
§ node individual workstation within a network or cluster; a collection of processors all
accessing the same memory (RAM).
§ CPU abbreviation for central processing unit. The processor of a node. Also
referred to as a socket.
• try cat /proc/cpuinfo
• Note that “processors” in the output are actually “cores” by the definition below
§ core separate execution core for calculations. e.g., “dual-core” means the
processor has two cores. Sometimes each core is referred to as a separate
processor.
§ slot a single core available for use within a node. e.g., if a node has 16 cores, it
will have 16 slots.
§ hyper-threaded technology (HTT) Where a single execution core is treated as
being two virtual cores (or two logical processors) by the system. Some of the nodes
in the cluster have HTT. E.g., if there are 16 physical cores, there would be 32 logical
processors.
§ thread a single process of a multi-process job. Each thread runs on a separate logical
processor. E.g., if you run tophat with -p 10, 10 threads will be created and run in
parallel.
These definitions are somewhat flexible…
27
Accessing the NIAID HPC
§ Request an HPC account (for NIAID members and collaborators only)
• https://hpcweb.niaid.nih.gov/#home
• “Request Account”
§ Login to HPC “submit node,” which is the computer from which you submit jobs.
ssh secure shell, remote login
ssh username@hpc.niaid.nih.gov
§ Copy
files
to/from
HPC
scp secure copy to remote location
scp -r ~/data/dir username@hpc.niaid.nih.gov:~/data/
§ ssh
and
scp
will
prompt
you
to
enter
your
password
28
15. 1/13/15
15
Mounting HPC Drives
29
Mac Windows
1. Click "Start" > "Computer”
2. Click on "Map Network
Drive”.
3. Choose an available drive
letter.
4. Enter ai-hpcfileserver.niaid.nih.govbcbb
in the “Folder” field,
replacing “bcbb” with your
group name or your user
name.
(For more details, see link to
FAQ below.)
https://hpcweb.niaid.nih.gov/#support?type=Links&requestType=HPC%20FAQs&name=41
Cluster Architecture and Access
image modified from http://ainkaboot.co.uk/
regular.q interactive.q memLong.q
qrsh -q interactive.q
qsub -q memLong.q
ssh username@hpc.niaid.nih.gov
Submit Node
30
16. 1/13/15
16
Cluster Queue System: Sun Grid Engine
§ Computers have Linux Red Hat Operating System
§ Grid Engine is a batch queuing system
§ Other queuing systems (http://en.wikipedia.org/wiki/Job_scheduler):
• Portable Batch System (PBS) (e.g., Biowulf)
• TORQUE Resource Manager
• Maui
• Moab
• others…
• Each will require a slightly different syntax for scripts
§ Comes with a set of commands to communicate with the cluster
§ Monitors available resources and users’ workloads to start jobs at the appropriate time
31
Grid Engine jobs
§ Three types of jobs
• Batch/Serial (one node,
one processor)
• Parallel (multiple
processors or nodes)
• Interactive
32
Input Process Output
Input Process Output
Process
Process
17. 1/13/15
17
Grid Engine Jobs: Interactive
§ Login to a node like ssh
qrsh -l h_vmem=20G
§ Need to specify parameters
-l requested resources in space-delimited list
• For interactive job:
h_vmem=
§ For Biowulf (PBS) (http://biowulf.nih.gov/user_guide.html#interactive):
qsub -I -V -l nodes=1
33
Cluster Architecture and Access
image from http://ainkaboot.co.uk/
regular.q interactive.q memLong.q
qrsh -q interactive.q
ssh username@hpc.niaid.nih.gov
Submit Node
34
18. 1/13/15
18
Test TopHat Job in Interactive Session
§ TopHat is a short read aligner for RNA-seq data
§ Manual:
§ http://ccb.jhu.edu/software/tophat/manual.shtml
1. Check dependencies (e.g., PATH)
2. Check command syntax and options
3. Run command with test dataset
35
Grid Engine Jobs: Batch / Serial
§ Single processor, one job
§ Submit a script to the cluster from the submit node,
“submit-1”
36
19. 1/13/15
19
Cluster Architecture and Access
image from http://ainkaboot.co.uk/
regular.q interactive.q memLong.q
qsub -q memLong.q script.sh
OR
qsub script.sh
*No queue necessary*
ssh username@hpc.niaid.nih.gov
Submit Node
37
Text Editors for Composing Scripts (batch jobs)
§ Not the same as a word processor! e.g., Microsoft Word
§ Try some, choose a favorite
§ Popular for Windows:
• Notepad++ (nice color-coding)
• EditPad Lite (can open large files > 4Gb)
§ Popular for Mac:
• TextWrangler
§ Popular for Terminal:
• nano
• vi
• emacs
§ http://en.wikipedia.org/wiki/Comparison_of_text_editors
38
20. 1/13/15
20
Quick Look at a Shell Script
Exercise 4:
cd ~/unix_hpc/test_data
cat test_serial.sh
§ A few things to notice:
• #!/bin/bash
– “shebang” or “hashbang,” used to specify the program to run for the script
• qsub options (next slide)
• export (used to set environmental variables)
• PATH=/path/to/folder:/path/to/another/folder:$PATH
– used to allow you to simply type the name of the executable instead of the
full path to the executable, e.g., type “tophat” instead of “/usr/local/
bio_apps/tophat/bin/tophat”
• Comments about when you ran the job
• Command for job
*PBS Script for Biowulf as well.
39
SGE qsub options
qsub [options] script.sh command to submit a job to the cluster
-S /bin/bash shell to use (default is csh)
-N job_name name for your job
-q queue.q queue(s) to submit to, e.g.,
memLong.q,memRegular.q
-M user@niaid.nih.gov email address to send alert to
-m abe when to send email (e.g., beginning, end, aborted)
-l resources resources to request, e.g.,
h_vmem=20G,h_cpu=1:00:00,mem_free=10G
-cwd run from current working directory. Output to here.
-j y join stderr and stdout into one
-pe threaded 10 parallel environment: “round” means processors
could be on separate machines, “threaded” all
processors on same machine. number of processors/
threads.
§ You can put these options on the command-line or in your shell
script
§ Lines with these options should begin with #$
40
21. 1/13/15
21
Submitting jobs with PBS (Biowulf)
§ PBS options and examples for Biowulf:
• http://biowulf.nih.gov/user_guide.html#batchsamp
§ Examples
• qsub -I -V -l nodes=1
• qsub -l nodes=1 myjob.bat
• qsub -l nodes=8:o2800 myparalleljob
• qsub -v np=3 -l nodes=2:g24:c24,mem=0 novompi.sh
§ Option lines start with #PBS instead of #$
§ Application-specific usage for Biowulf as well, e.g.,
41
Grid Engine Jobs: Batch / Serial
§ Submit a script to the cluster from the
submit node
Exercise 5:
cd ~/unix_hpc/test_data (remember to try tab
completion J)
qsub test_serial.sh
It should say “Your job XXXXXX ("tophat_test") has been
submitted” where XXXXXX is the job number.
ls –al
Do you see a file called tophat_test.oXXXXXX where
XXXXXX is your job number?
cat tophat_test.oXXXXXX (substitute job number for
XXXXXX)
42
22. 1/13/15
22
Grid Engine Jobs: Parallel
§ pe commands (threaded, single, etc.)
§ Basic use in script:
#$ -pe threaded 8
§ Can also use advanced options, e.g.,
• "-pe 12threaded 48" means use 12 cores per node, for a total
of 48 cores needed. This will allocate the job to run on 4 nodes
with 12 cores each. Your program must be able to support this
• "-pe threaded 5-10" means run the job with 10 if available, but
down to 5 cores is fine too.
§ Do the math for memory!
• h_vmem is not total, it’s per thread. E.g., if you have a job that
needs 10G total, running on 5 processors, you’ll assign
h_vmem=2G, not h_vmem=10G.
• Let’s edit our script to make it run parallel…
43
Edit Shell Script in the Terminal with nano
Navigation in nano:
§ use arrow keys for up, down, left, right
§ Ctrl+a for beginning of line; Ctrl+e for end of line
§ Other commands at bottom of screen e.g., Ctrl+o, Ctrl+x
Exercise 6:
cd ~/unix_hpc/test_data
Make new script for parallel, open in nano
cp test_serial.sh test_parallel.sh
nano test_parallel.sh
Add line to script with SGE options
#$ -pe threaded 4
Modify tophat command
tophat -p 4 …
Save and close
Ctrl+o, [ENTER]
Ctrl+x
Now submit the jobs
qsub test_serial.sh
qsub test_parallel.sh
44
23. 1/13/15
23
Monitoring Jobs
Exercise 7:
qsub test_tenminutes.sh
qstat check on submitted jobs
echo $LOGNAME check your username
qstat -u $LOGNAME check status or your jobs
qstat -u $LOGNAME -ext check resource usage, including memory
qstat -u $LOGNAME -ext -g t get extended details, including MASTER, SLAVE
nodes for parallel jobs
qstat -j job-ID get detailed information about your job status
qacct –j 999072 see info about a job after it was run
qalter [new qsub options] [job id] In case you want to change parameters while in
“qw” status
qdel –u username delete all of your submitted jobs
qdel jobnumber delete a single job
§ Websites
• Cluster status:
http://hpcweb.niaid.nih.gov/#about?type=About%20Links&requestType=Cluster
%20Status
• Current State: http://hpcwiki.niaid.nih.gov/index.php/Current_State
• Ganglia toolkit: http://cluster.niaid.nih.gov/ganglia/
45
Contact Us
andrew.oler@nih.gov
ScienceApps@niaid.nih.gov
h5p://bioinforma;cs.niaid.nih.gov
46
24. 1/13/15
24
Example Script For SGE
#!/bin/bash
## SGE options (see man qsub for more options)
#$ -S /bin/bash #type of shell. default is csh
#$ -N tophat_test #name of job
#$ -q regular.q,memRegular.q #which queue to submit job to.
#$ -M andrewsgarbage@gmail.com #email address to send email to
#$ -m abe #when to send email: aborted, beginning, end
#$ -l h_vmem=5G,h_cpu=1:00:00 #resources (virtual memory, cpu time)
#$ -cwd #run the script from current working directory
#$ -j y #join stderr and stdout into one job_id.o file
## Script dependencies
#export the path for bowtie (tophat needs this)
export PATH=$PATH:/usr/local/bio_apps/bowtie
export PATH=$PATH:/usr/local/bio_apps/tophat/bin
export PATH=$PATH:/usr/local/bio_apps/samtools/
## Write comments (to make the future you happy)
# Ran tophat on the test dataset - andrew (111013)
#full path to tophat: /usr/local/bio_apps/tophat/bin/tophat
time tophat -r 20 test_ref reads_1.fq reads_2.fq
47
“hashbang,” to specify program used to run script
qsuboptions
export command for
setting environment
variables
command for job