Your SlideShare is downloading. ×
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
CHAP7.DOC.doc
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

CHAP7.DOC.doc

331

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
331
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. Description of Major Machine Learning Software Packages How to use the SNNS for implementing ANN Sudipto saha Introduction SNNS (Stuttgart Neural Network Simulator) is a software simulator for neural networks on Unix workstations developed at the Institute for Parallel and Distributed High Performance Systems (IPVR) at the University of Stuttgart. The goal of the SNNS project is to create an efficient and flexible simulation environment for research on and application of neural nets. The SNNS simulator consists of two main components: 1) simultor kernel written in C 2) graphical user interface under X11R4 or X11R5 The simulator kernel operates on the internal network data structures of the neural nets and performs all operations of learning and recall. It can also be used without the other parts as a C program embedded in custom applications. It supports arbitrary network topologies and, like RCS, supports the concept of sites. SNNS can be extended by the user with user defined activation functions, output functions, site functions and learning procedures, which are written as simple C programs and linked to the simulator kernel. The graphical user interface XGUI (X Graphical User Interface), built on top of the ker- nel, gives a 2D and a 3D graphical representation of the neural networks and controls the kernel during the simulation run. In addition, the 2D user interface has an integrated net- work editor which can be used to directly create, manipulate and visualize neural nets in various ways. Building Blocks of Neural Nets The following paragraph describes a generic model for those neural nets that can be gen- erated by the SNNS simulator. The basic principles and the terminology used in dealing with the graphical interface are also briefly introduced. A more general and more de- tailed introduction to connectionism can, e.g., be found in [RM86]. For readers fluent in German, the most comprehensive and up to date book on neural network learning algo- rithms, simulation systems and neural hardware is probably [Zel94] 7-1
  2. A network consists of units and directed, weighted links (connections) between them. In analogy to activation passing in biological neurons, each unit receives a net in- put that is computed from the weighted outputs of prior units with connections leading to this unit. Picture shows a small network. Figure: A small network with three layers of units The actual information processing within the units is modeled in the SNNS simulator with the activation function and the output function. The activation function first com- putes the net input of the unit from the weighted output values of prior units. It then com- putes the new activation from this net input (and possibly its previous activation). The output function takes this result to generate the output of the unit. These functions can be arbitrary C functions linked to the simulator kernel and may be different for each unit. Our simulator uses a discrete clock. Time is not modeled explicitly (i.e. there is no prop- agation delay or explicit modeling of activation functions varying over time). Rather, the net executes in update steps, where is the activation of a unit one step after . The SNNS simulator, just like the Rochester Connectionist Simulator (RCS, [God87]), offers the use of sites as additional network element. Sites are a simple model of the dendrites of a neuron which allow a grouping and different treatment of the input signals of a cell. Each site can have a different site function. This selective treatment of incom- ing information allows more powerful connectionist models. Figure shows one unit with sites and one without. Figure: One unit with sites and one without In the following all the various network elements are described in detail. 7-2
  3. Units Depending on their function in the net, one can distinguish three types of units: The units whose activations are the problem input for the net are called input units; the units whose output represent the output of the net output units. The remaining units are called hidden units, because they are not visible from the outside (see e.g. figure ). In most neural network models the type correlates with the topological position of the unit in the net: If a unit does not have input connections but only output connections, then it is an input unit. If it lacks output connections but has input units, it is an output unit. If it has both types of connections it is a hidden unit. It can, however, be the case that the output of a topologically internal unit is regarded as part of the output of the network. The IO-type of a unit used in the SNNS simulator has to be understood in this manner. That is, units can receive input or generate output even if they are not at the fringe of the network. Below, all attributes of a unit are listed: • no: For proper identification, every unit has a number attached to it. This number defines the order in which the units are stored in the simulator kernel. • name: The name can be selected arbitrarily by the user. It must not, however, contain blanks or special characters, and has to start with a letter. It is useful to select a short name that describes the task of the unit, since the name can be displayed with the network. • io-type or io: The IO-type defines the function of the unit within the net. The following alternatives are possible o input: input unit o output: output unit o dual: both input and output unit o hidden: internal, i.e. hidden unit o special: this type can be used in any way, depending upon the application. In the standard version of the SNNS simulator, the weights to such units are not adapted in the learning algorithm (see paragraph ). o special input, special hidden, special output: sometimes it is necessary to to know where in the network a special unit is located. These three types enable the correlation of the units to the various layers of the network. • activation: The activation value. • initial activation or i_act: This variable contains the initial activation 7-3
  4. value, present after the initial loading of the net. This initial configuration can be reproduced by resetting ( reset) the net, e.g. to get a defined starting state of the net. • output: the output value. • bias: In contrast to other network simulators where the bias (threshold) of a unit is simulated by a link weight from a special 'on'-unit, SNNS represents it as a unit parameter. In the standard version of SNNS the bias determines where the activation function has its steepest ascent. (see e.g. the activation function Act_logistic). Learning procedures like backpropagation change the bias of a unit like a weight during training. activation function or actFunc: A new activation is computed from the output of preceding units, usually multiplied by the weights connecting these predecessor units with the current unit, the old activation of the unit and its bias. When sites are being used, the network input is computed from the site values. How to obtain SNNS The SNNS simulator can be obtained from download area (http://www- ra.informatik.uni-tuebingen.de/downloads/SNNS/) or via anonymous ftp (deprecated) from host ftp.informatik.uni-tuebingen.de in the subdirectory /pub/SNNS as file SNNSv4.1.tar.Z (2.6 MB) or in zipped version SNNSv4.1.tar.gz (1.6 MB) Be sure to set the ftp mode to binary before transmission of the files. Also watch out for possible higher version numbers, patches or Readme files in the above directory /pub/SNNS. After successful transmission of the file move it to the directory where you want to install SNNS, uncompress and extract the file with the Unix commands uncompress SNNSv4.1.tar.Z tar xvf SNNSv4.1.tar The SNNS distribution includes full source code, installation procedures for supported machine architectures and some simple examples of trained networks. The PostScript version of the user manual can be obtained as file SNNSv4.1.Manual.ps.Z (1.6 MB) or in 15 parts as files SNNSv4.1.Manual.part01.ps.Z ... SNNSv4.1.Manual.part15.ps.Z 7-4
  5. SNNS 4.2 for MS-Windows (http://www-ra.informatik.uni- tuebingen.de/downloads/SNNS/Windows/ ) On-line SNNS User Manual (version 4.1) On line manual is available at this site http://www ra.informatik.unituebingen.de/SNNS/UserManual/UserManual.html Input file in txt Total number of sequence in this file is 78 and 10mer in length. An example of sequences in fasta format, >IgE_epitope1 AEDEDNQQGQ >IgE_epitope2 AEEVEEERLK >IgE_epitope3 AKSSPYQKKT >IgE_epitope4 APRIVLDVAS >IgE_epitope5 AVADVTPKQL >IgE_epitope6 AVITWRALNK >IgE_epitope7 AVPLYNRFSY >IgE_epitope8 CDRPPKHSQN >IgE_epitope9 CSGTKKLSEE >IgE_epitope10 DGKTGSSTPH Input file in binary form Input file for SNNS, here amino acid composition of the sequences are calculated. Note that the first 7 lines of the input file. First two lines , followed by to blank line then the number of patterens (78 in this case, since total seqience is 78), number of input units (20 in this case, calculating the amino acid composition) and the out puts (1, one value) 7-5
  6. SNNS pattern definition file V4.2 generated at Sat Aug 27 16:40:25 2005 No. of patterns : 78 No. of input units : 20 No. of output units : 1 # Input pattern 1: 0.1 0 0.2 0.2 0 0.1 0 0 0 0 0 0.1 0 0.3 0 0 0 0 0 0 # Output pattern 1: 1 # Input pattern 2: 0.1 0 0 0.5 0 0 0 0 0.1 0.1 0 0 0 0 0.1 0 0 0.1 0 0 # Output pattern 2: 1 Output file of SNNS The out put file of the SNNS is shown. The result show the summary of information. SNNS result file V1.4-3D generated at Tue Aug 30 08:58:52 2005 No. of patterns : 26 No. of input units : 20 No. of output units : 1 startpattern :1 endpattern : 26 input patterns included teaching output included #1.1 0.1 0 0.1 0 0 0 0 0.1 0.1 0.2 0 0 0 0 0 0.1 0.1 0 0 0.2 1 Out put of SNNS 0.64832 #2.1 0 0 0.1 0.1 0.3 0.1 0 0 0 0 0 0 0.2 0.1 0 0 0.1 0 0 0 1 0.6276 The outputs of the SNNS are process at different threshold (0.1 to 1), and parameters like sensitivity, specificity, and accuracy are calculated. The Artificial neural network tries to classify positive from negative examples. For example here we take an example of IgE epitopes and non epitopes. We need a data set of IgE epitope (positive set) and negative 7-6
  7. set (non epitopes). The Netwok will classify this training set, it will be validated by one set (to stop over fitting) and then tested by the left out testing set. Each set contains equal number of sequence. In five fold cross validation it looks like this, Training set Validation set Testing set set 1,2,3 set 4 set 5 set 1,4,5 set 3 set 4 set 1,4,5 set 2 set 3 set 3,4,5 set 1 set 2 set 2,3,4 set 5 set 1 Processing of output data The out put data are processed and interpreted, as shown (Thres=Threshold; Sen=Sensi- tivity; Spe= Specificity; Acc=Accuracy; PPV=positive prediction value) Thres Sen Spe Acc PPV 1.0000 0.0000 0.0000 0.0000 0.0000 0.9000 0.0214 0.9929 0.5071 0.7500 0.8000 0.1429 0.9857 0.5643 0.9091 0.7000 0.2571 0.9571 0.6071 0.8571 0.6000 0.5143 0.8357 0.6750 0.7579 0.5000 0.7214 0.7214 0.7214 0.7214 0.4500 0.8071 0.6000 0.7036 0.6686 0.4000 0.8571 0.4714 0.6643 0.6186 0.3000 0.9571 0.3286 0.6429 0.5877 0.2000 1.0000 0.1000 0.5500 0.5263 0.1000 1.0000 0.0071 0.5036 0.5018 7-7
  8. How to use SVMlight efficiently for Implementing SVM Sneh Lata Pandey Description SVMlight is an implementation of Support Vector Machines (SVMs) in C. SVMlight is an implementation of Vapnik's Support Vector Machine for the problem of pattern recognition, for the problem of regression, and for the problem of learning a ranking function. The algorithm has scalable memory requirements and can handle problems with many thousands of support vectors efficiently. The software also provides methods for assessing the generalization performance efficiently. It includes two efficient estimation methods for both error rate and precision/recall. Source Code and Binaries The program is free for scientific use. The software must not be further distributed without prior permission of the author. The implementation was developed on Solaris 2.5 with gcc, but compiles also on SunOS 3.1.4, Solaris 2.7, Linux, IRIX, Windows NT, and Powermac. The source code is available at the following location: http://download.joachims.org/svm_light/current/svm_light.tar.gz If you just want the binaries, you can download them for the following systems: Solaris: http://download.joachims.org/svm_light/current/svm_light_solaris.tar.gz Windows: http://download.joachims.org/svm_light/current/svm_light_windows.zip Cygwin: http://download.joachims.org/svm_light/current/svm_light_cygwin.tar.gz Linux: http://download.joachims.org/svm_light/current/svm_light_linux.tar.gz Installation To install SVMlight you need to download svm_light.tar.gz. Create a new directory: mkdir svm_light Move svm_light.tar.gz to this directory and unpack it with gunzip -c svm_light.tar.gz | tar xvf - Now execute make or make all which compiles the system and creates the two executables svm_learn (learning module) svm_classify (classification module) How to use This section explains how to use the SVMlight software. SVMlight consists of a learning module (svm_learn) and a classification module (svm_classify). The classification module can be used to apply the learned model to new examples. See also the examples below for how to use svm_learn and svm_classify. svm_learn is called with the following parameters: svm_learn [options] example_file model_file 7-8
  9. Available options are: General options: -? - this help -v [0..3] - verbosity level (default 1) Learning options: -z {c,r,p} - select between classification (c), regression (r), and preference ranking (p) (see [Joachims, 2002c]) (default classification) -c float - C: trade-off between training error and margin (default [avg. x*x]^-1) -w [0..] - epsilon width of tube for regression (default 0.1) -j float - Cost: cost-factor, by which training errors on positive examples outweight errors on negative examples (default 1) (see [Morik et al., 1999]) -b [0,1] - use biased hyperplane (i.e. x*w+b0) instead of unbiased hyperplane (i.e. x*w0) (default 1) -i [0,1] - remove inconsistent training examples and retrain (default 0) Performance estimation options: -x [0,1] - compute leave-one-out estimates (default 0) -o ]0..2] - value of rho for XiAlpha-estimator and for pruning leave-one-out computation (default 1.0) (see [Joachims, 2002a]) -k [0..100] - search depth for extended XiAlpha-estimator (default 0) Kernel options: -t int - type of kernel function: 0: linear (default) 1: polynomial (s a*b+c)^d 2: radial basis function exp(-gamma ||a-b||^2) 3: sigmoid tanh(s a*b + c) 4: user defined kernel from kernel.h -d int - parameter d in polynomial kernel -g float - parameter gamma in rbf kernel -s float - parameter s in sigmoid/poly kernel -r float - parameter c in sigmoid/poly kernel -u string - parameter of user defined kernel Optimization options (see [Joachims, 1999a], [Joachims, 2002a]): -q [2..] - maximum size of QP-subproblems (default 10) -n [2..q] - number of new variables entering the working set in each iteration (default n = q). Set n<q to prevent zig-zagging. -m [5..] - size of cache for kernel evaluations in MB (default 40) The larger the faster... -e float - eps: Allow that error for termination criterion [y [w*x+b] - 1] = eps (default 0.001) 7-9
  10. -h [5..] - number of iterations a variable needs to be optimal before considered for shrinking (default 100) -f [0,1] - do final optimality check for variables removed by shrinking. Although this test is usually positive, there is no guarantee that the optimum was found if the test is omitted. (default 1) -y string -> if option is given, reads alphas from file with given and uses them as starting point. (default 'disabled') -# int -> terminate optimization, if no progress after this number of iterations. (default 100000) Output options: -l char - file to write predicted labels of unlabeled examples into after transductive learning -a char - write all alphas to this file after learning (in the same order as in the training set) A more detailed description of the parameters and how they link to the respective algorithms is given in the appendix of [Joachims, 2002a]. The input file example_file contains the training examples. The first lines may contain comments and are ignored if they start with #. Each of the following lines represents one training example and is of the following format: <line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info> <target> .=. +1 | -1 | 0 | <float> <feature> .=. <integer> | "qid" <value> .=. <float> <info> .=. <string> A space character separates the target value and each of the feature/value pairs. Feature/value pairs MUST be ordered by increasing feature number. Features with value zero can be skipped. The string <info> can be used to pass additional information to the kernel (e.g. non feature vector data). In classification mode, the target value denotes the class of the example. +1 as the target value marks a positive example, -1 a negative example respectively. So, for example, the line -1 1:0.43 3:0.12 9284:0.2 # abcdef specifies a negative example for which feature number 1 has the value 0.43, feature number 3 has the value 0.12, feature number 9284 has the value 0.2, and all the other features have value 0. In addition, the string abcdef is stored with the vector, which can serve as a way of providing additional information for user defined kernels. The order of the predictions is the same as in the training data. In regression mode, the <target> contains the real-valued target value. In all modes, the result of svm_learn is the model which is learned from the training data in example_file. The model is written to model_file. To make predictions on test examples, svm_classify reads this file. svm_classify is called with the following parameters: svm_classify [options] example_file model_file output_file Available options are: -h Help. -v [0..3] Verbosity level (default 2). 7-10
  11. -f [0,1] 0: old output format of V1.0 1: output the value of decision function (default) The test examples in example_file are given in the same format as the training examples (possibly with 0 as class label). For all test examples in example_file the predicted values are written to output_file. There is one line per test example in output_file containing the value of the decision function on that example. For classification, the sign of this value determines the predicted class. For regression, it is the predicted value itself, and for ranking the value can be used to order the test examples. The test example file has the same format as the one for svm_learn. Again, <class> can have the value zero indicating unknown. 7-11
  12. How to use HMMER Manish Kumar 2 Installation Quick installation instructions configuring, compiling, and installing a source code distribution Download the source tarball (hmmer.tar.gz) from ftp://ftp.genetics.wustl.edu/pub/ eddy/hmmer/ or http://hmmer.wustl.edu/. Unpack the software: > tar xvf hmmer.tar.gz Go into the newly created top-level directory (named hmmer-xx, where xxis a release number): > cd hmmer-2.3.2 Configure for your system, and build the programs: > ./configure > make Run the automated testsuite. This is optional. All these tests should pass: > make check The programs are now in the src/ subdirectory. The man pages are in the documentation/man subdirectory. You can manually move or copy all of these to appropriate locations if you want. You will want the programs to be in your $PATH. Optionally, you can install the man pages and programs in system-wide directories. If you are happy with the default (programs in /usr/local/bin/ and man pages in /usr/local/ man/man1), do: > make install (You might need to be root when you install, depending on the permissions on your /usr/local directories.) That’s all. Each of these steps is documented in more detail below, including how to change the default installation directories for make install. Configuring and Installing a precompiled binary distribution Alternatively, you can obtain a precompiled binary distribution of HMMER from http://hmmer.wustl.edu/. Thanks to generous hardware support from many manufacturers, binary distributions are available for most common UNIX and UNIX-like OS’s. For example, the distribution for Intel x86/GNU Linux machines is hmmer-2.3.2.bin.intel-linux.tar.gz. After you download a binary distribution, unpack it: > tar xvf hmmer.bin.intel-linux.tar.gz HMMER is now in the newly created top-level directory (named hmmer-xx, where xx is a release number). Go into it:  cd hmmer-2.3.2 You don’t really need to do anything else. The programs are in the binaries/ subdirectory. The man pages are in the documentation/man subdirectory. The PDF 7-12
  13. copy of the Userguide is in the top level HMMER directory (Userguide.pdf). You can manually move or copy all of these to appropriate locations if you want. You will want the programs to be in your $PATH. However, you’ll often want to install in a more permanent place. To configure with the default locations (programs in /usr/local/bin/ and man pages in /usr/local/man/man1) and install everything, do: > ./configure > make install If you want to install in different places than the defaults, keep reading; see the beginning of the section on running the configure script. System requirements and portability HMMER is designed to run on UNIX platforms. The code is POSIX-compliant ANSI C. You need a UNIX operating system to run it. You need an ANSI C compiler if you want to build it from source. Linux and Apple Macintosh OS/X both count as UNIX. Microsoft operating systems do not. How-ever, HMMER is known to be easy to port to Microsoft Windows and other non-UNIX operating systems, provided that the platform supports ANSI C and some reasonable level of POSIX compliance. Running the testsuite (make check) requires that you have Perl (specifically, /usr/bin/perl). However, Perl isn’t necessary to make HMMER work. HMMER has support for two kinds of parallelization: POSIX multithreading and PVM (Parallel Vir-tual Machine) clustering. Both are optional, not compiled by default; they are enabled by passing the --enable-threadsor --enable-pvmoptions to the ./configurescript before compilation. The pre-compiled binary distributions generally support multithreading but not PVM. Tutorial Here’s a tutorial walk-through of some small projects with HMMER. This section should be sufficient to get you started on work of your own, and you can (at least temporarily) skip the rest of the Guide. The programs in HMMER There are currently nine programs supported in the HMMER 2 package: hmmalign Align sequences to an existing model. hmmbuild Build a model from a multiple sequence alignment. hmmcalibrate Takes an HMM and empirically determines 7-13
  14. parameters that are used to make searches more sensitive, by calculating more accurate expectation value scores (E-values). hmmconvert Convert a model file into different formats, including a compact HMMER 2 binary format, and “best effort” emulation of GCG profiles. hmmemit Emit sequences probabilistically from a profile HMM.hmmfetch Get a single model from an HMM database.hmmindex Index an HMM database. hmmpfam Search an HMM database for matches to a query sequence.hmmsearch Search a sequence database for matches to an HMM. Files used in the tutorial The subdirectory /tutorialin the HMMER distribution contains the files used in the tutorial, as well as a number of examples of various file formats that HMMER reads. The important files for the tutorial are: globins50.msf An alignment file of 50 aligned globin sequences, in GCG MSF format. globins630.fa A FASTA format file of 630 unaligned globin sequences. fn3.sto An alignment file of fibronectin type III domains, in Stockholm format. (From Pfam 8.0.) rrm.sto An alignment file of RNA recognition motif domains, in Stockholm format. (From Pfam 8.0). rrm.hmm An example HMM, built from rrm.sto pkinase.sto An alignment file of protein kinase catalytic domains, in Stockholm format. (From Pfam 8.0). Artemia.fa A FASTA file of brine shrimp globin, which contains nine tandemly repeated globin domains. 7LESDROME A SWISSPROT file of the Drosophila Sevenless sequence, a receptor tyrosinekinase with multiple domains. RU1AHUMAN A SWISSPROT file of the human U1A protein sequence, which contains two RRMdomains. Create a new directory that you can work in, and copy all the files in tutorial there. I’ll assume for the following examples that you’ve installed the HMMER programs in your path; if not, 7-14
  15. you’ll need to give a complete path name to the HMMER programs (e.g. something like /usr/people/eddy/hmmer-2.2/binaries/hmmbuild instead of just hmmbuild). Format of input alignment files HMMER starts with a multiple sequence alignment file that you provide. HMMER can read alignments in several common formats, including the output of the CLUSTAL family of programs, Wisconsin/GCG MSF format, the input format for the PHYLIP phylogenetic analysis programs, and “alighed FASTA” format (where the sequences in a FASTA file contain gap symbols, so that they are all the same length). HMMER’s native alignment format is called Stockholm format, the format of the Pfam protein database that allows extensive markup and annotation. All these formats are documented in a later section. The software autodetects the alignment file format, so you don’t have to worry about it. Most of the example alignments in the tutorial are Stockholm files. rrm.sto is a simple example (generated by stripping all the extra annotation off of a Pfam RNA recognition motif seed alignment). pkinase.sto and fn3.sto are original Pfam seed alignments, with all their annotation. Searching a sequence database with a single profile HMM One common use of HMMER is to search a sequence database for homologues of a protein family of interest. You need a multiple sequence alignment of the sequence family you’re interested in. � Can I build a model from unaligned sequences? In principle, profile HMMs can be trainedfrom  unaligned sequences; however, this functionality is temporarily withdrawn from HMMER.  Irecommend CLUSTALW as an excellent, freely available multiple sequence alignment  program.The original hmmt profile HMM training program from HMMER 1 is also still available,  fromftp://ftp.genetics.wustl.edu/pub/eddy/hmmer/hmmer­1.8.4.tar.Z. build a profile HMM with hmmbuild Let’s assume you have a multiple sequence alignment of a protein domain or protein sequence family. To use HMMER to search for additional remote homologues of the family, you want to first build a profile HMM from the alignment. The following command builds a profile HMM from the alignment of 50 globin sequences in globins50.msf: > hmmbuild globin.hmm globins50.msf This gives the following output: hmmbuild -build a hidden Markov model from an alignmentHMMER 2.3 (April 2003)Copyright (C) 1992-2003 HHMI/Washington University School of MedicineFreely distributed under the GNU General Public License (GPL) Alignment file: globins50.msf 20 File format: MSFSearch algorithm configuration: Multiple domain (hmmls)Model construction strategy: MAP (gapmax hint: 0.50)Null model used: (default)Prior 7-15
  16. used: (default)Sequence weighting method: G/S/C tree weightsNew HMM file: globin.hmm Alignment: #1Number of sequences: 50Number of columns: 308 Determining effective sequence number ... done. [2]Weighting sequences heuristically ... done.Constructing model architecture ... done.Converting counts to probabilities ... done.Setting model name, etc. ... done. [globins50] Constructed a profile HMM (length 143)Average score: 189.04 bitsMinimum score: -17.62 bitsMaximum score: 234.09 bitsStd. deviation: 53.18 bits Finalizing model configuration ... done.Saving model to file ... done.// The process takes a second or two. hmmbuild create a new HMM file called globin.hmm. This is a human and computer readable ASCII text file, but for now you don’t care. You also don’t care for now what all the stuff in the output means; I’ll describe it in detail later. The profile HMM can be treated as a compiled model of your alignment. calibrate the profile HMM with hmmcalibrate This step is optional, but doing it will increase the sensitivity of your database search. When you search a sequence database, it is useful to get “E-values” (expectation values) in addition to raw scores. When you see a database hit that scores x, an E-value tells you the number of hits you would’ve expected to score xor more just by chance in a sequence database of this size. HMMER will always estimate an E-value for your hits. However, unless you “calibrate” your model before a database search, HMMER uses an analytic upper bound calculation that is extremely conservative. An empirical HMM calibration costs time (about 10% the time of a SWISSPROT search) but it only has to be done once per model, and can greatly increase the sensitivity of a database search. To empirically calibrate the E-value calculations for the globin model, type: > hmmcalibrate globin.hmm which results in: hmmcalibrate --calibrate HMM search statisticsHMMER 2.3 (April 2003)Copyright (C) 1992-2003 HHMI/Washington University School of MedicineFreely distributed under the GNU General Public License (GPL) HMM file: globin.hmmLength distribution mean: 325Length distribution s.d.: 200Number of samples: 5000random seed: 1051632537 histogram(s) saved to: [not saved]POSIX threads: 4 HMM : globins50mu : -39.897396lambda : 0.226086max : -9.567000// This might take several minutes, depending on your machine. Go have a cup of coffee. When it is complete, the relevant parameters are added to the HMM file. (Note from the “POSIX threads: 4” line that I’m running on 4 CPUs on a quad-processor box. I’m impatient.) Calibrated HMMER E-values tend to be relatively accurate. E-values of 0.1 or less are, in general, significant hits. Uncalibrated HMMER E-values are also reliable, erring on the cautious side; uncalibrated models may miss remote homologues. � Why doesn’t hmmcalibrate always give the same output, if I run it on the same HMM?  It’s fitting a distribution to the scores obtained from a random (Monte Carlo) simulation of a  small sequence database, and this random sequence database is different each time. You can  make hmmcalibrate give reproducible results by making it initialize its random number  7-16
  17. generator with the same seed, using the --seed <x>option, where  xisanypositiveinteger.Bydefault,it choosesa“random”seed,which it reports in the output  header. You can reproduce an hmmcalibrate run by passing this number as the seed.  (Trivia: the default seed is the number of seconds that have passed since the UNIX “epoch” ­ usually January 1, 1970. hmmcalibrate runs started in the same second will give  identical results. Beware, if you’re trying to measure the variance of HMMER’s estimated  λ ˆand µˆparameters...)  search the sequence database with hmmsearch As an example of searching for new homologues using a profile HMM, we’ll use the globin model to search for globin domains in the example Artemia globin sequence in Artemia.fa: > hmmsearch globin.hmm Artemia.fa The output comes in several sections, and unlike building and calibrating the HMM, where we treated the HMM as a black box, now you do care about what it’s saying. The first section is the header that tells you what program you ran, on what, and with what options: hmmsearch -search a sequence database with a profile HMMHMMER 2.3 (April 2003)Copyright (C) 1992-2003 HHMI/Washington University School of MedicineFreely distributed under the GNU General Public License (GPL) HMM file: globin.hmm [globins50]Sequence database: Artemia.faper-sequence score cutoff: [none]per-domain score cutoff: [none]per-sequence Eval cutoff: <= 10per- domain Eval cutoff: [none] Query HMM: globins50Accession: [none]Description: [none] [HMM has been calibrated; E-values are empirical estimates] The second section is the sequence top hits list. It is a list of ranked top hits (sorted by E-value, most significant hit first), formatted in a BLAST-like style: Scores for complete sequences (score includes all domains):Sequence Description Score E-value N S13421 S13421 GLOBIN -BRINE SHRIMP 474.3 1.7e-143 7-17

×