1. Description of Major Machine Learning Software
Packages
How to use the SNNS for implementing ANN
Sudipto saha
Introduction
SNNS (Stuttgart Neural Network Simulator) is a software simulator for neural networks
on Unix workstations developed at the Institute for Parallel and Distributed High
Performance Systems (IPVR) at the University of Stuttgart. The goal of the SNNS
project is to create an efficient and flexible simulation environment for research on and
application of neural nets.
The SNNS simulator consists of two main components:
1) simultor kernel written in C
2) graphical user interface under X11R4 or X11R5
The simulator kernel operates on the internal network data structures of the neural nets
and performs all operations of learning and recall. It can also be used without the other
parts as a C program embedded in custom applications. It supports arbitrary network
topologies and, like RCS, supports the concept of sites. SNNS can be extended by the
user with user defined activation functions, output functions, site functions and learning
procedures, which are written as simple C programs and linked to the simulator kernel.
The graphical user interface XGUI (X Graphical User Interface), built on top of the ker-
nel, gives a 2D and a 3D graphical representation of the neural networks and controls the
kernel during the simulation run. In addition, the 2D user interface has an integrated net-
work editor which can be used to directly create, manipulate and visualize neural nets in
various ways.
Building Blocks of Neural Nets
The following paragraph describes a generic model for those neural nets that can be gen-
erated by the SNNS simulator. The basic principles and the terminology used in dealing
with the graphical interface are also briefly introduced. A more general and more de-
tailed introduction to connectionism can, e.g., be found in [RM86]. For readers fluent in
German, the most comprehensive and up to date book on neural network learning algo-
rithms, simulation systems and neural hardware is probably [Zel94]
7-1
2. A network consists of units and directed, weighted links (connections) between
them. In analogy to activation passing in biological neurons, each unit receives a net in-
put that is computed from the weighted outputs of prior units with connections leading to
this unit. Picture shows a small network.
Figure: A small network with three layers of units
The actual information processing within the units is modeled in the SNNS simulator
with the activation function and the output function. The activation function first com-
putes the net input of the unit from the weighted output values of prior units. It then com-
putes the new activation from this net input (and possibly its previous activation). The
output function takes this result to generate the output of the unit. These functions can
be arbitrary C functions linked to the simulator kernel and may be different for each unit.
Our simulator uses a discrete clock. Time is not modeled explicitly (i.e. there is no prop-
agation delay or explicit modeling of activation functions varying over time). Rather, the
net executes in update steps, where is the activation of a unit one step after .
The SNNS simulator, just like the Rochester Connectionist Simulator (RCS, [God87]),
offers the use of sites as additional network element. Sites are a simple model of the
dendrites of a neuron which allow a grouping and different treatment of the input signals
of a cell. Each site can have a different site function. This selective treatment of incom-
ing information allows more powerful connectionist models. Figure shows one unit
with sites and one without.
Figure: One unit with sites and one without
In the following all the various network elements are described in detail.
7-2
3. Units
Depending on their function in the net, one can distinguish three types of units: The
units whose activations are the problem input for the net are called input units; the units
whose output represent the output of the net output units. The remaining units are
called hidden units, because they are not visible from the outside (see e.g. figure ).
In most neural network models the type correlates with the topological position of the
unit in the net: If a unit does not have input connections but only output connections,
then it is an input unit. If it lacks output connections but has input units, it is an output
unit. If it has both types of connections it is a hidden unit.
It can, however, be the case that the output of a topologically internal unit is regarded as
part of the output of the network. The IO-type of a unit used in the SNNS simulator has
to be understood in this manner. That is, units can receive input or generate output even
if they are not at the fringe of the network.
Below, all attributes of a unit are listed:
• no: For proper identification, every unit has a number attached to it. This
number defines the order in which the units are stored in the simulator kernel.
• name: The name can be selected arbitrarily by the user. It must not, however,
contain blanks or special characters, and has to start with a letter. It is useful to
select a short name that describes the task of the unit, since the name can be
displayed with the network.
• io-type or io: The IO-type defines the function of the unit within the net. The
following alternatives are possible
o input: input unit
o output: output unit
o dual: both input and output unit
o hidden: internal, i.e. hidden unit
o special: this type can be used in any way, depending upon the application.
In the standard version of the SNNS simulator, the weights to such units
are not adapted in the learning algorithm (see paragraph ).
o special input, special hidden, special output: sometimes it is necessary to
to know where in the network a special unit is located. These three types
enable the correlation of the units to the various layers of the network.
• activation: The activation value.
• initial activation or i_act: This variable contains the initial activation
7-3
4. value, present after the initial loading of the net. This initial configuration can be
reproduced by resetting ( reset) the net, e.g. to get a defined starting state of the
net.
• output: the output value.
• bias: In contrast to other network simulators where the bias (threshold) of a unit
is simulated by a link weight from a special 'on'-unit, SNNS represents it as a unit
parameter. In the standard version of SNNS the bias determines where the
activation function has its steepest ascent. (see e.g. the activation function
Act_logistic). Learning procedures like backpropagation change the bias of a
unit like a weight during training.
activation function or actFunc: A new activation is computed from the output of
preceding units, usually multiplied by the weights connecting these predecessor units
with the current unit, the old activation of the unit and its bias. When sites are being
used, the network input is computed from the site values.
How to obtain SNNS
The SNNS simulator can be obtained from download area (http://www-
ra.informatik.uni-tuebingen.de/downloads/SNNS/) or via anonymous ftp (deprecated)
from host
ftp.informatik.uni-tuebingen.de
in the subdirectory
/pub/SNNS
as file
SNNSv4.1.tar.Z (2.6 MB)
or in zipped version
SNNSv4.1.tar.gz (1.6 MB)
Be sure to set the ftp mode to binary before transmission of the files. Also watch out for
possible higher version numbers, patches or Readme files in the above directory
/pub/SNNS. After successful transmission of the file move it to the directory where you
want to install SNNS, uncompress and extract the file with the Unix commands
uncompress SNNSv4.1.tar.Z
tar xvf SNNSv4.1.tar
The SNNS distribution includes full source code, installation procedures for supported
machine architectures and some simple examples of trained networks. The PostScript
version of the user manual can be obtained as file
SNNSv4.1.Manual.ps.Z (1.6 MB)
or in 15 parts as files
SNNSv4.1.Manual.part01.ps.Z
...
SNNSv4.1.Manual.part15.ps.Z
7-4
5. SNNS 4.2 for MS-Windows (http://www-ra.informatik.uni-
tuebingen.de/downloads/SNNS/Windows/ )
On-line SNNS User Manual (version 4.1)
On line manual is available at this site
http://www ra.informatik.unituebingen.de/SNNS/UserManual/UserManual.html
Input file in txt
Total number of sequence in this file is 78 and 10mer in length.
An example of sequences in fasta format,
>IgE_epitope1
AEDEDNQQGQ
>IgE_epitope2
AEEVEEERLK
>IgE_epitope3
AKSSPYQKKT
>IgE_epitope4
APRIVLDVAS
>IgE_epitope5
AVADVTPKQL
>IgE_epitope6
AVITWRALNK
>IgE_epitope7
AVPLYNRFSY
>IgE_epitope8
CDRPPKHSQN
>IgE_epitope9
CSGTKKLSEE
>IgE_epitope10
DGKTGSSTPH
Input file in binary form
Input file for SNNS, here amino acid composition of the sequences are calculated.
Note that the first 7 lines of the input file. First two lines , followed by to blank line then
the number of patterens (78 in this case, since total seqience is 78), number of input units
(20 in this case, calculating the amino acid composition) and the out puts (1, one value)
7-5
6. SNNS pattern definition file V4.2
generated at Sat Aug 27 16:40:25 2005
No. of patterns : 78
No. of input units : 20
No. of output units : 1
# Input pattern 1:
0.1 0 0.2 0.2 0 0.1 0 0 0 0 0 0.1 0 0.3 0 0 0 0 0 0
# Output pattern 1:
1
# Input pattern 2:
0.1 0 0 0.5 0 0 0 0 0.1 0.1 0 0 0 0 0.1 0 0 0.1 0 0
# Output pattern 2:
1
Output file of SNNS
The out put file of the SNNS is shown. The result show the summary of information.
SNNS result file V1.4-3D
generated at Tue Aug 30 08:58:52 2005
No. of patterns : 26
No. of input units : 20
No. of output units : 1
startpattern :1
endpattern : 26
input patterns included
teaching output included
#1.1
0.1 0 0.1 0 0 0 0 0.1 0.1 0.2
0 0 0 0 0 0.1 0.1 0 0 0.2
1 Out put of SNNS
0.64832
#2.1
0 0 0.1 0.1 0.3 0.1 0 0 0 0
0 0 0.2 0.1 0 0 0.1 0 0 0
1
0.6276
The outputs of the SNNS are process at different threshold (0.1 to 1), and parameters like
sensitivity, specificity, and accuracy are calculated. The Artificial neural network tries to
classify positive from negative examples. For example here we take an example of IgE
epitopes and non epitopes. We need a data set of IgE epitope (positive set) and negative
7-6
7. set (non epitopes). The Netwok will classify this training set, it will be validated by one
set (to stop over fitting) and then tested by the left out testing set. Each set contains equal
number of sequence. In five fold cross validation it looks like this,
Training set Validation set Testing set
set 1,2,3 set 4 set 5
set 1,4,5 set 3 set 4
set 1,4,5 set 2 set 3
set 3,4,5 set 1 set 2
set 2,3,4 set 5 set 1
Processing of output data
The out put data are processed and interpreted, as shown (Thres=Threshold; Sen=Sensi-
tivity; Spe= Specificity; Acc=Accuracy; PPV=positive prediction value)
Thres Sen Spe Acc PPV
1.0000 0.0000 0.0000 0.0000 0.0000
0.9000 0.0214 0.9929 0.5071 0.7500
0.8000 0.1429 0.9857 0.5643 0.9091
0.7000 0.2571 0.9571 0.6071 0.8571
0.6000 0.5143 0.8357 0.6750 0.7579
0.5000 0.7214 0.7214 0.7214 0.7214
0.4500 0.8071 0.6000 0.7036 0.6686
0.4000 0.8571 0.4714 0.6643 0.6186
0.3000 0.9571 0.3286 0.6429 0.5877
0.2000 1.0000 0.1000 0.5500 0.5263
0.1000 1.0000 0.0071 0.5036 0.5018
7-7
8. How to use SVMlight efficiently for Implementing SVM
Sneh Lata Pandey
Description
SVMlight is an implementation of Support Vector Machines (SVMs) in C. SVMlight is an
implementation of Vapnik's Support Vector Machine for the problem of pattern
recognition, for the problem of regression, and for the problem of learning a ranking
function. The algorithm has scalable memory requirements and can handle problems with
many thousands of support vectors efficiently. The software also provides methods for
assessing the generalization performance efficiently. It includes two efficient estimation
methods for both error rate and precision/recall.
Source Code and Binaries
The program is free for scientific use. The software must not be further distributed
without prior permission of the author. The implementation was developed on Solaris
2.5 with gcc, but compiles also on SunOS 3.1.4, Solaris 2.7, Linux, IRIX, Windows NT,
and Powermac. The source code is available at the following location:
http://download.joachims.org/svm_light/current/svm_light.tar.gz
If you just want the binaries, you can download them for the following systems:
Solaris: http://download.joachims.org/svm_light/current/svm_light_solaris.tar.gz
Windows: http://download.joachims.org/svm_light/current/svm_light_windows.zip
Cygwin: http://download.joachims.org/svm_light/current/svm_light_cygwin.tar.gz
Linux: http://download.joachims.org/svm_light/current/svm_light_linux.tar.gz
Installation
To install SVMlight you need to download svm_light.tar.gz. Create a new directory:
mkdir svm_light
Move svm_light.tar.gz to this directory and unpack it with
gunzip -c svm_light.tar.gz | tar xvf -
Now execute
make or make all
which compiles the system and creates the two executables
svm_learn (learning module)
svm_classify (classification module)
How to use
This section explains how to use the SVMlight software. SVMlight consists of a learning
module (svm_learn) and a classification module (svm_classify). The classification
module can be used to apply the learned model to new examples. See also the examples
below for how to use svm_learn and svm_classify.
svm_learn is called with the following parameters:
svm_learn [options] example_file model_file
7-8
9. Available options are:
General options:
-? - this help
-v [0..3] - verbosity level (default 1)
Learning options:
-z {c,r,p} - select between classification (c), regression (r), and
preference ranking (p) (see [Joachims, 2002c])
(default classification)
-c float - C: trade-off between training error
and margin (default [avg. x*x]^-1)
-w [0..] - epsilon width of tube for regression
(default 0.1)
-j float - Cost: cost-factor, by which training errors on
positive examples outweight errors on negative
examples (default 1) (see [Morik et al., 1999])
-b [0,1] - use biased hyperplane (i.e. x*w+b0) instead
of unbiased hyperplane (i.e. x*w0) (default 1)
-i [0,1] - remove inconsistent training examples
and retrain (default 0)
Performance estimation options:
-x [0,1] - compute leave-one-out estimates (default 0)
-o ]0..2] - value of rho for XiAlpha-estimator and for pruning
leave-one-out computation (default 1.0)
(see [Joachims, 2002a])
-k [0..100] - search depth for extended XiAlpha-estimator
(default 0)
Kernel options:
-t int - type of kernel function:
0: linear (default)
1: polynomial (s a*b+c)^d
2: radial basis function exp(-gamma ||a-b||^2)
3: sigmoid tanh(s a*b + c)
4: user defined kernel from kernel.h
-d int - parameter d in polynomial kernel
-g float - parameter gamma in rbf kernel
-s float - parameter s in sigmoid/poly kernel
-r float - parameter c in sigmoid/poly kernel
-u string - parameter of user defined kernel
Optimization options (see [Joachims, 1999a], [Joachims, 2002a]):
-q [2..] - maximum size of QP-subproblems (default 10)
-n [2..q] - number of new variables entering the working set
in each iteration (default n = q). Set n<q to prevent
zig-zagging.
-m [5..] - size of cache for kernel evaluations in MB (default 40)
The larger the faster...
-e float - eps: Allow that error for termination criterion
[y [w*x+b] - 1] = eps (default 0.001)
7-9
10. -h [5..] - number of iterations a variable needs to be
optimal before considered for shrinking (default 100)
-f [0,1] - do final optimality check for variables removed by
shrinking. Although this test is usually positive, there
is no guarantee that the optimum was found if the test
is omitted. (default 1)
-y string -> if option is given, reads alphas from file with given
and uses them as starting point. (default 'disabled')
-# int -> terminate optimization, if no progress after this
number of iterations. (default 100000)
Output options:
-l char - file to write predicted labels of unlabeled examples
into after transductive learning
-a char - write all alphas to this file after learning (in the
same order as in the training set)
A more detailed description of the parameters and how they link to the respective
algorithms is given in the appendix of [Joachims, 2002a].
The input file example_file contains the training examples. The first lines may contain
comments and are ignored if they start with #. Each of the following lines represents one
training example and is of the following format:
<line> .=. <target> <feature>:<value> <feature>:<value> ...
<feature>:<value> # <info>
<target> .=. +1 | -1 | 0 | <float>
<feature> .=. <integer> | "qid"
<value> .=. <float>
<info> .=. <string>
A space character separates the target value and each of the feature/value pairs.
Feature/value pairs MUST be ordered by increasing feature number. Features with value
zero can be skipped. The string <info> can be used to pass additional information to the
kernel (e.g. non feature vector data).
In classification mode, the target value denotes the class of the example. +1 as the target
value marks a positive example, -1 a negative example respectively. So, for example, the
line
-1 1:0.43 3:0.12 9284:0.2 # abcdef
specifies a negative example for which feature number 1 has the value 0.43, feature
number 3 has the value 0.12, feature number 9284 has the value 0.2, and all the other
features have value 0. In addition, the string abcdef is stored with the vector, which can
serve as a way of providing additional information for user defined kernels. The order of
the predictions is the same as in the training data.
In regression mode, the <target> contains the real-valued target value.
In all modes, the result of svm_learn is the model which is learned from the training
data in example_file. The model is written to model_file. To make predictions on test
examples, svm_classify reads this file. svm_classify is called with the following
parameters:
svm_classify [options] example_file model_file output_file
Available options are:
-h Help.
-v [0..3] Verbosity level (default 2).
7-10
11. -f [0,1] 0: old output format of V1.0
1: output the value of decision function (default)
The test examples in example_file are given in the same format as the training
examples (possibly with 0 as class label). For all test examples in example_file the
predicted values are written to output_file. There is one line per test example in
output_file containing the value of the decision function on that example. For
classification, the sign of this value determines the predicted class. For regression, it is
the predicted value itself, and for ranking the value can be used to order the test
examples. The test example file has the same format as the one for svm_learn. Again,
<class> can have the value zero indicating unknown.
7-11
12. How to use HMMER
Manish Kumar
2 Installation
Quick installation instructions configuring, compiling, and installing a source code
distribution
Download the source tarball (hmmer.tar.gz) from ftp://ftp.genetics.wustl.edu/pub/
eddy/hmmer/ or http://hmmer.wustl.edu/.
Unpack the software:
> tar xvf hmmer.tar.gz
Go into the newly created top-level directory (named hmmer-xx, where xxis a release
number):
> cd hmmer-2.3.2
Configure for your system, and build the programs:
> ./configure
> make
Run the automated testsuite. This is optional. All these tests should pass:
> make check
The programs are now in the src/ subdirectory. The man pages are in the
documentation/man subdirectory. You can manually move or copy all of these to
appropriate locations if you want. You will want the programs to be in your $PATH.
Optionally, you can install the man pages and programs in system-wide directories. If
you are happy with the default (programs in /usr/local/bin/ and man pages in /usr/local/
man/man1), do: > make install (You might need to be root when you install, depending
on the permissions on your /usr/local directories.) That’s all. Each of these steps is
documented in more detail below, including how to change the default installation
directories for make install.
Configuring and Installing a precompiled binary distribution
Alternatively, you can obtain a precompiled binary distribution of HMMER from
http://hmmer.wustl.edu/. Thanks to generous hardware support from many
manufacturers, binary distributions are available for most common UNIX and UNIX-like
OS’s. For example, the distribution for Intel x86/GNU Linux machines is
hmmer-2.3.2.bin.intel-linux.tar.gz.
After you download a binary distribution, unpack it:
> tar xvf hmmer.bin.intel-linux.tar.gz HMMER is now in
the newly created top-level directory (named hmmer-xx, where
xx is a release number). Go into it:
cd hmmer-2.3.2
You don’t really need to do anything else. The programs are in the binaries/
subdirectory. The man pages are in the documentation/man subdirectory. The PDF
7-12
13. copy of the Userguide is in the top level HMMER directory (Userguide.pdf). You can
manually move or copy all of these to appropriate locations if you want. You will want
the programs to be in your $PATH. However, you’ll often want to install in a more
permanent place. To configure with the default locations (programs in /usr/local/bin/ and
man pages in /usr/local/man/man1) and install everything, do:
> ./configure
> make install
If you want to install in different places than the defaults, keep reading; see the
beginning of the section on running the configure script.
System requirements and portability
HMMER is designed to run on UNIX platforms. The code is POSIX-compliant
ANSI C. You need a UNIX operating system to run it. You need an ANSI C
compiler if you want to build it from source.
Linux and Apple Macintosh OS/X both count as UNIX. Microsoft operating
systems do not. How-ever, HMMER is known to be easy to port to Microsoft
Windows and other non-UNIX operating systems, provided that the platform
supports ANSI C and some reasonable level of POSIX compliance. Running the
testsuite (make check) requires that you have Perl (specifically, /usr/bin/perl).
However, Perl isn’t necessary to make HMMER work. HMMER has support for two
kinds of parallelization: POSIX multithreading and PVM (Parallel Vir-tual Machine)
clustering. Both are optional, not compiled by default; they are enabled by passing
the --enable-threadsor --enable-pvmoptions to the ./configurescript before
compilation. The pre-compiled binary distributions generally support multithreading
but not PVM.
Tutorial
Here’s a tutorial walk-through of some small projects with HMMER. This section should be
sufficient to get you started on work of your own, and you can (at least temporarily) skip the rest of
the Guide.
The programs in HMMER
There are currently nine programs supported in the HMMER 2 package:
hmmalign Align sequences to an existing model. hmmbuild
Build a model from a multiple sequence alignment.
hmmcalibrate Takes an HMM and empirically determines
7-13
14. parameters that are used to make searches more sensitive, by
calculating more accurate expectation value scores (E-values).
hmmconvert Convert a model file into different formats,
including a compact HMMER 2 binary format, and “best effort”
emulation of GCG profiles.
hmmemit Emit sequences probabilistically from a profile HMM.hmmfetch Get a
single model from an HMM database.hmmindex Index an HMM database.
hmmpfam Search an HMM database for matches to a query sequence.hmmsearch
Search a sequence database for matches to an HMM.
Files used in the tutorial
The subdirectory /tutorialin the HMMER distribution contains the files used in the tutorial, as
well as a number of examples of various file formats that HMMER reads. The important files for
the tutorial are:
globins50.msf An alignment file of 50 aligned
globin sequences, in GCG MSF format.
globins630.fa A FASTA format file of 630
unaligned globin sequences. fn3.sto An
alignment file of fibronectin type III domains, in
Stockholm format. (From Pfam 8.0.) rrm.sto An
alignment file of RNA recognition motif domains, in
Stockholm format. (From Pfam 8.0). rrm.hmm An
example HMM, built from rrm.sto pkinase.sto
An alignment file of protein kinase catalytic
domains, in Stockholm format. (From Pfam 8.0).
Artemia.fa A FASTA file of brine shrimp
globin, which contains nine tandemly repeated
globin domains.
7LESDROME A SWISSPROT file of the Drosophila Sevenless sequence, a receptor
tyrosinekinase with multiple domains.
RU1AHUMAN A SWISSPROT file of the human U1A protein sequence, which contains
two RRMdomains.
Create a new directory that you can work in, and copy all the files in tutorial there. I’ll
assume for the following examples that you’ve installed the HMMER programs in your path; if not,
7-14
15. you’ll need to give a complete path name to the HMMER programs (e.g. something like
/usr/people/eddy/hmmer-2.2/binaries/hmmbuild instead of just hmmbuild).
Format of input alignment files
HMMER starts with a multiple sequence alignment file that you provide. HMMER can
read alignments in several common formats, including the output of the CLUSTAL
family of programs, Wisconsin/GCG MSF format, the input format for the PHYLIP
phylogenetic analysis programs, and “alighed FASTA” format (where the sequences in a
FASTA file contain gap symbols, so that they are all the same length). HMMER’s native
alignment format is called Stockholm format, the format of the Pfam protein database
that allows extensive markup and annotation. All these formats are documented in a later
section. The software autodetects the alignment file format, so you don’t have to worry
about it. Most of the example alignments in the tutorial are Stockholm files. rrm.sto is a
simple example (generated by stripping all the extra annotation off of a Pfam RNA
recognition motif seed alignment). pkinase.sto and fn3.sto are original Pfam seed
alignments, with all their annotation.
Searching a sequence database with a single profile HMM
One common use of HMMER is to search a sequence database for homologues
of a protein family of interest. You need a multiple sequence alignment of the
sequence family you’re interested in.
� Can I build a model from unaligned sequences? In principle, profile HMMs can be trainedfrom
unaligned sequences; however, this functionality is temporarily withdrawn from HMMER.
Irecommend CLUSTALW as an excellent, freely available multiple sequence alignment
program.The original hmmt profile HMM training program from HMMER 1 is also still available,
fromftp://ftp.genetics.wustl.edu/pub/eddy/hmmer/hmmer1.8.4.tar.Z.
build a profile HMM with hmmbuild
Let’s assume you have a multiple sequence alignment of a protein domain or
protein sequence family. To use HMMER to search for additional remote
homologues of the family, you want to first build a profile HMM from the
alignment. The following command builds a profile HMM from the alignment
of 50 globin sequences in globins50.msf:
> hmmbuild globin.hmm globins50.msf
This gives the following output:
hmmbuild -build a hidden Markov model from an alignmentHMMER 2.3 (April
2003)Copyright (C) 1992-2003 HHMI/Washington University School of MedicineFreely
distributed under the GNU General Public License (GPL)
Alignment file: globins50.msf
20
File format: MSFSearch algorithm configuration: Multiple domain (hmmls)Model
construction strategy: MAP (gapmax hint: 0.50)Null model used: (default)Prior
7-15
16. used: (default)Sequence weighting method: G/S/C tree weightsNew HMM file:
globin.hmm
Alignment: #1Number of sequences: 50Number of columns: 308
Determining effective sequence number ... done. [2]Weighting sequences
heuristically ... done.Constructing model architecture ... done.Converting counts
to probabilities ... done.Setting model name, etc. ... done. [globins50]
Constructed a profile HMM (length 143)Average score: 189.04 bitsMinimum score:
-17.62 bitsMaximum score: 234.09 bitsStd. deviation: 53.18 bits
Finalizing model configuration ... done.Saving model to file ... done.//
The process takes a second or two. hmmbuild create a new HMM file called globin.hmm.
This is a human and computer readable ASCII text file, but for now you don’t care. You also don’t
care for now what all the stuff in the output means; I’ll describe it in detail later. The profile HMM
can be treated as a compiled model of your alignment.
calibrate the profile HMM with hmmcalibrate
This step is optional, but doing it will increase the sensitivity of your database search.
When you search a sequence database, it is useful to get “E-values” (expectation values) in
addition to raw scores. When you see a database hit that scores x, an E-value tells you the number of
hits you would’ve expected to score xor more just by chance in a sequence database of this size.
HMMER will always estimate an E-value for your hits. However, unless you “calibrate” your
model before a database search, HMMER uses an analytic upper bound calculation that is extremely
conservative. An empirical HMM calibration costs time (about 10% the time of a SWISSPROT
search) but it only has to be done once per model, and can greatly increase the sensitivity of a
database search. To empirically calibrate the E-value calculations for the globin model, type:
> hmmcalibrate globin.hmm
which results in:
hmmcalibrate --calibrate HMM search statisticsHMMER 2.3 (April 2003)Copyright (C)
1992-2003 HHMI/Washington University School of MedicineFreely distributed under
the GNU General Public License (GPL)
HMM file: globin.hmmLength distribution mean: 325Length distribution s.d.:
200Number of samples: 5000random seed: 1051632537
histogram(s) saved to: [not saved]POSIX threads: 4
HMM : globins50mu : -39.897396lambda : 0.226086max : -9.567000//
This might take several minutes, depending on your machine. Go have a cup of coffee. When it is
complete, the relevant parameters are added to the HMM file. (Note from the “POSIX threads: 4”
line that I’m running on 4 CPUs on a quad-processor box. I’m impatient.)
Calibrated HMMER E-values tend to be relatively accurate. E-values of 0.1 or less are, in
general, significant hits. Uncalibrated HMMER E-values are also reliable, erring on the cautious side;
uncalibrated models may miss remote homologues.
� Why doesn’t hmmcalibrate always give the same output, if I run it on the same HMM?
It’s fitting a distribution to the scores obtained from a random (Monte Carlo) simulation of a
small sequence database, and this random sequence database is different each time. You can
make hmmcalibrate give reproducible results by making it initialize its random number
7-16
17. generator with the same seed, using the --seed <x>option, where
xisanypositiveinteger.Bydefault,it choosesa“random”seed,which it reports in the output
header. You can reproduce an hmmcalibrate run by passing this number as the seed.
(Trivia: the default seed is the number of seconds that have passed since the UNIX “epoch”
usually January 1, 1970. hmmcalibrate runs started in the same second will give
identical results. Beware, if you’re trying to measure the variance of HMMER’s estimated
λ ˆand µˆparameters...)
search the sequence database with hmmsearch
As an example of searching for new homologues using a profile HMM, we’ll use
the globin model to search for globin domains in the example Artemia globin
sequence in Artemia.fa: > hmmsearch globin.hmm Artemia.fa
The output comes in several sections, and unlike building and calibrating
the HMM, where we treated the HMM as a black box, now you do care about
what it’s saying. The first section is the header that tells you what program
you ran, on what, and with what options:
hmmsearch -search a sequence database with a profile HMMHMMER 2.3 (April
2003)Copyright (C) 1992-2003 HHMI/Washington University School of MedicineFreely
distributed under the GNU General Public License (GPL)
HMM file: globin.hmm [globins50]Sequence database: Artemia.faper-sequence score
cutoff: [none]per-domain score cutoff: [none]per-sequence Eval cutoff: <= 10per-
domain Eval cutoff: [none]
Query HMM: globins50Accession: [none]Description: [none]
[HMM has been calibrated; E-values are empirical estimates]
The second section is the sequence top hits list. It is a list of ranked top hits (sorted by E-value,
most significant hit first), formatted in a BLAST-like style:
Scores for complete sequences (score includes all domains):Sequence Description
Score E-value N
S13421 S13421 GLOBIN -BRINE SHRIMP 474.3 1.7e-143
7-17