A comparative study using different measure of filteration
57 bio infomark
1. BioInfoMark: A Bioinformatic Benchmark Suite for Computer
Architecture Research
Yue Li and Tao Li
Intelligent Design of Efficient Architecture Lab(IDEAL)
Department of Electrical and Computer Engineering
University of Florida
yli@ecel.ufl.edu, taoli@ece.ufl.edu
ABSTRACT Bioinformatics allows researchers to sift through the massive
bioinformatics data (e.g., nucleic acid and protein sequences,
The exponential growth in the amount of genomic data has structures, functions, pathways and interactions) and identify
spurred growing interest in large scale analysis of genetic information of interest.
information. Bioinformatics applications, which explore Today, bioinformatics has become an industry and has
computational methods to allow researchers to sift through the gained popularity among numerous markets including
massive biological data and extract useful information, are pharmaceutical, (industrial, agricultural and environmental)
becoming increasingly important computer workloads. This biotechnology and homeland security. A number of recent
paper presents BioInfoMark, a benchmark suite of market research reports estimate the size of the bioinformatics
representative bioinformatics applications to facilitate the market as $176 billion and the market is projected to grow to
design and evaluation of computer architectures for these $243 billion within the next 5 years [2]. In August 2000, IBM
emerging workloads. Currently, the BioInfoMark suite announced an initial $100 million investment to spur business
contains 14 highly popular bioinformatics tools and covers the development in the life sciences, assuring its prominence as
major fields of study in computational biology such as one of the emerging computing markets. There are over 1,300
sequence comparison, phylogenetic analysis, protein structure biotech companies only in US and some 1,600 companies in
analysis, and molecular dynamics simulation. Europe.
The BioInfoMark package includes benchmark source code, Clearly, computer systems that can cost-effectively deliver
input datasets and information for compiling and using the high-performance on computational biology applications play
benchmarks. To allow computer architecture researchers to a vita role in the future growth of the bioinformatics market. In
run the BioInfoMark suite on several popular execution driven order to apply a quantitative approach in computer architecture
simulators, we provide pre-compiled little-endian Alpha ISA design, optimization and performance evaluation, researchers
binaries and generated simulation points. The BioInfoMark need to identify representative workloads from this emerging
package is freely available and can be downloaded from: application domain first.
http://www.ideal.ece.ufl.edu/BioInfoMark. This paper presents BioInfoMark, a benchmark suite of
1. INTRODUCTION representative bioinformatics applications to facilitate the
design and evaluation of computer architectures for these
The major breakthrough in the field of molecular biology, emerging workloads. Currently, the BioInfoMark suite
coupled with advances in genomic technologies has led to an contains 14 highly popular bioinformatics tools and covers the
explosive growth in the area of informatics. For example, the major fields of study in computational biology such as
National Center for Biotechnology Information (NCBI) sequence comparison, phylogenetic analysis, protein structure
GenBank, an annotated collection of all publicly available analysis, and molecular dynamics simulation. Sequence
DNA sequences, has been growing at an exponential rate. comparison finds similarities between two or more DNA or
protein sequences. Phylogeny explores the ancestral
relationship among a set of genes or organisms. Protein
structure analysis (a) finds the similarities between three-
dimensional protein structures and (b) predicts the shape of a
protein (e.g., primary, secondary, and tertiary structure) given
its amino acid sequence. Molecular dynamics explores the
interactions among biomolecules.
Compared with a recent study reported in [3], our
independent work covers many more bioinformatics tools in
terms of quantity and diversity. To allow computer
architecture researchers to run the BioInfoMark suite on
Figure 1. Growth of GenBank: There are 49,398,852,122 bases several popular execution driven simulators (e.g. Simplescalar,
from 45,236,251 sequences in GenBank as of Jun. 15 2005 SimAlpha, and M5), we put additional effort into providing
(source: [1]).
pre-compiled little-endian Alpha ISA binaries and generating
As genomics moves forward, having accessible simulation points. A detailed, quantitative workload
computational methods with which to extract, view, and characterization of the BioInfoMark benchmarks on Pentium 4
analyze genomic information, becomes essential. microarchitecture can be found in [4].
1
2. The rest of this paper is organized as follows. Section 2
provides an introductory background on biology and a brief
review of bioinformatics study areas. Section 3 describes the
BioInfoMark suite, including benchmark functionality, input
datasets, benchmark compilation and execution, the pre-
compiled Alpha binaries, and the generated simulation
points[5]. Section 4 concludes the paper and outlines our
future work.
2. BACKGROUND
To help readers to understand the BioInfoMark benchmarks
better, we first provide an introductory background on biology Figure 3. Three dimensional structure of human foetal
and illustrate the major areas of bioinformatics. deoxyhaemoglobin (PDB id = 1FDH).
2.1 Introduction: DNA, Gene and Proteins 2.2 Bioinformatics Problems
One of the fundamental principles of biology is that within In this section, we illustrate the major problems in
each cell, DNA that comprises the genes encodes RNA which bioinformatics, including sequence analysis, phylogeny,
in turn produces the proteins that regulate all of the biological protein structure analysis/prediction and molecular dynamics.
processes within an organism. 2.2.1 Sequence Analysis
DNA is a double chain of simpler molecules called
nucleotides, tied together in a double helix helical structure Sequence analysis is perhaps the most commonly performed
(Figure 2). The nucleotides are distinguished by a nitrogen task in bioinformatics. Sequence analysis can be defined as the
base that can be of four kinds: adenine (A), cytosine (C), problem of finding which parts of the sequences (nucleotide or
guanine (G) and thymine (T). Adenine (A) always bonds to amino acid sequences) are similar and which parts are different.
thymine (T) whereas cytosine (C) always bonds to guanine (G), By comparing sequences, researchers can gain crucial
forming base pairs. A DNA can be specified uniquely by understanding of their significance and functionality: high
listing its sequence of nucleotides, or base pairs. Proteins are sequence similarity usually implies significant functional or
molecules that accomplish most of the functions of a living structural similarity while sequence differences hold the key
cell, determining its shape and structure. A protein is a linear information regarding diversity and evolution.
sequence of molecules called amino acids. Twenty different The most commonly used sequence analysis technique is
amino acids are commonly found in proteins. Similar to DNA, pairwise sequence comparison. A sequence can be
proteins are conveniently represented as a string of letters transformed to another sequence with the help of three edit
expressing their sequence of amino acids. operations. Each edit operation can insert a new letter, delete
an existing letter, or replace an existing letter with a new one.
The alignment of two sequences is defined by the edit
operations that transform one into the other. This is usually
represented by writing one on top of the other. Insertions and
deletions (i.e., gaps) are represented by the dash symbol (“-”).
The following example illustrates an alignment between the
sequences A= “GAATTCAGTA” and B= “GGATCGTTA”. The
objective is to match identical subsequences as far as possible
(or equivalently use as few edit operations as possible). In the
example, the aligned sequences match in seven positions.
Figure 2. DNA molecule. Sequence A GAATTCAGT-A
R D D I
Once a protein is produced, it folds into a three-dimensional Sequence B GGA-TC-GTTA
shape. The positions of the central atoms, called carbon-alpha
(C∝), of the amino acids of a protein define its primary Figure 4. Alignment of two sequences (The aligned sequences
structure. If a contiguous subsequence of C∝ atoms follows match in seven positions. One replace, two delete, and one
some predefined pattern, they are classified as a secondary insert operations, shown by letters R, D, and I, are used.)
structure, such as alpha-helix or beta-sheet. The relative Alignment of sequences is considered in two different but
positioning of the secondary structures define the tertiary related classes: If the entire sequences are aligned, then it is
structure. The overall shape of all chains of a protein then called a global alignment. If subsequences of two sequences
defines the quaternary structure. are aligned, then it is called a local alignment.
Multiple sequence alignment compares more than two
sequences: all sequences are aligned on top of each other. Each
column is the alignment of one letter from each sequence. The
2
3. following example illustrates a multiple alignment among the
sequences A= “AGGTCAGTCTAGGAC”, B= “GGACTGAGGTC”, and
C=“GAGGACTGGCTACGGAC”.
Sequence A -AGGTCAGTCTA-GGAC
Sequence B --GGACTGA----GGTC
Sequence C GAGGACTGGCTACGGAC
Figure 5. Multiple alignment of three DNA sequences A, B, and C.
2.2.2 Molecular Phylogeny Analysis
Figure 7. The structural similarity between two proteins.
Molecular phylogeny infers lines of ancestry of genes or (source http://cl.sdsc.edu/)
organisms. Phylogeny analysis provides crucial understanding
about the origins of life and the homology of various species 2.2.4 Molecular Dynamics
on earth. Phylogenetic trees are composed of nodes and
In the broadest sense, molecular dynamics is concerned with
branches. Each leaf node corresponds to a gene or an organism.
molecular motion. Motion is inherent to all chemical processes.
Internal nodes represent inferred ancestors. The evolutionary
Simple vibrations, like bond stretching and angle bending,
distance between two genes or organisms is computed as a
give rise to IR spectra. Chemical reactions, hormone-receptor
function of the length of the branches between their nodes and
binding, and other complex processes are associated with
their common ancestors.
many kinds of intra- and intermolecular motions.
Molecular dynamics allows the studying of the dynamics of
large macromolecules, including biological systems such as
proteins, nucleic acids (DNA, RNA), and membranes.
Dynamic events may play a key role in controlling processes
which affect functional properties of biomolecules. Drug
design is commonly used in the pharmaceutical industry to test
properties of a molecule at the computer without the need to
synthesize it (which is far more expensive).
2.3 Bioinformatics Databases
A bioinformatics database is an organized body of persistent
Figure 6. Evolutionary relationships between fish models. data (e.g. nucleotide and amino acid sequences, three-
Figure 6 shows evolutionary relationships between fish dimensional structure). Thanks to the human genome project,
models. This evolutionary tree (based on data from Fishes of there has been a growing interest both in the public and private
the World by J. S. Nelson) [6] illustrates that the last common sectors towards creating bioinformatics databases. At the end
ancestor of medaka and zebrafish lived more than 110 million of 2002, there were more than 300 molecular biology
years (Myr) ago. databases available worldwide. This section provides a brief
overview of several popular and publicly available
2.2.3 Protein Structure Analysis bioinformatics databases.
Two protein substructures are called similar if their C∝ An important class of bioinformatics databases is the
atoms can be mapped to close by points after translation and sequence database. The largest sequence database is the
rotation of one of the proteins. This can also be considered as a NCBI/GenBank [1] which collects all known nucleotide and
one to one mapping of amino acids. Usually, structural protein sequences. Other major data sources are EMBL
similarity requires that the amino acid pairs that are considered (European Molecular Biology Lab) [7] and DDBJ (DNA Data
similar have the same secondary structure type. Structural Bank of Japan) [8]. Two major sources of protein sequences
similarities among proteins provide insight regarding their and structures are PDB (Protein Data Bank) [9], and SWISS-
functional relationship. Figure 7 presents the structural PROT [10]. PDB contains the protein structures determined by
similarity of two proteins. NMR and X-ray crystallography techniques. SWISS-PROT is a
Three-dimensional structures of only a small subset of curated protein sequence database which provides a high level
proteins are known as it requires expensive wet-lab of annotation such as description of protein function, its
experimentation. Computationally determining the structure of domain structure, post-translational modification and other
proteins is an important problem as it accelerates the useful information.
experimentation step and reduces expert analysis. Usually, the
relationship among chemical components of proteins (i.e. their 3. THE BIOINFORMATICS BENCHMARK
amino acid sequences) is used in determining their unique SUITE: BIOINFOMARK
three-dimensional native structures.
To allow computer architecture researchers to explore and
evaluate their designs on these emerging applications, we
3
4. developed a suite of representative bioinformatics workloads - includes several applications such as hmmbuild, hmmcalibrate
BioInfoMark. Currently, the BioInfoMark package contains 14 and hmmsearch. Among these applications, the hmmsearch is
applications, which covers a variety of major important widely used to search a sequence database for matches to an
bioinformatics tools ranging from sequence comparison to HMM. The syntax of invoking benchmark hmmsearch
molecular dynamics. This section describes the selected is:./hmmsearch <input file> <database file>. With the
programs, which can be classified using the categories we provided dataset, this benchmark can be executed
introduced in Section 2.2. as: ./hmmsearch ./globin.hmm ./Artemia.fa, where globin.hmm
3.1 Sequence Analysis Benchmarks is the example HMM built from the alignment file of 50
aligned globin sequences and the Artemia.fa is a FASTA file
Blast: The Blast (Basic Local Alignment Search Tool) of brine shrimp globin, which contains nine tandemly repeated
programs [11] are a set of heuristic methods that are used to globin domains.
search sequence databases for local alignments to a query Glimmer: Glimmer (Gene Locator and Interpolated Markov
sequence. The Blast programs are written in C. BlastP and Modeler) [15] finds genes in microbial DNA. Its uses
BlastN are the versions of Blast for protein and nucleotide interpolated Markov models (IMMs) to identify coding and
sequences respectively. All Blast programs can be executed noncoding regions in the DNA. Glimmer is written in the C++
using the following command line: ./blastall –p <option> -i language and it can be executed as ./glimmer2 <input
<query file> -d <database file> -o <output file>. The option sequence> <model file>. The command to invoke this
is “blastp” for searching protein sequences or “blastn” for benchmark on the given dataset
searching nucleotide sequences. The query file is the file is ./glimmer2 ./NC_000907.fna ./glimmer.icm, where
which includes the nucleotide or protein sequence for search. NC_000907.fna is a kind of bacterium whose name is
The database file is the database which will be searched. With Haemophilus_influenzae and glimmer.icm is the collection file
the provided dataset, the benchmark can be invoked as of Markov models.
follows: ./blastall –p blastp –i target.txt –d nr –o output, Emboss: Emboss (European Molecular Biology Open
where target.txt is the homo sapiens hereditary Software Suite) [16] is a software package programmed in C,
haemochromatosis protein sequence and nr is the non- which contains a wide variety of programs ranging from
redundant protein sequence database NCBI. sequence alignment, protein motif identification to domain
Fasta: Similar to Blast, Fasta [12] is a collection of local analysis, and codon usage analysis. Diffseq, megamerg and
similarity search programs for sequence databases. While shuffleseq are three representatives in the Emboss. Diffseq
Fasta and Blast both do pairwise local alignment, their takes two overlapping, nearly identical sequences and reports
underlying algorithms are different. Fasta is programmed in C the differences between them, together with any features that
and can be invoked using the following command overlap with these regions. The syntax of this benchmark
line: ./fasta34 <query file> <database file>. The query file execution is ./diffseq <seq1> <seq2> -wordsize <output>,
and database file have the same meaning as those of Blast. where seq1 and seq2 are two sequences for comparison. The
With the provided dataset, the Fasta benchmark can be wordsize refers to the size of which the program does a match
invoked as: ./fasta34 ./qrhuld.aa ../database/nr > ./output.txt, of all sequence words. The output records the result after
where qrhuld.aa is a query file that contains the human LDL differentiating both two sequences. With the provided dataset,
receptor precursor protein. The nr is the same database diffseq can be invoked as:./diffseq tembl:ap000504
mentioned above. tembl:af129756 -wordsize 6 report, where ap000504 and
Clustal W: Clustal W [13] is a multiple sequence alignment af129756 are two homo sapiens genes in the nucleic acid
program for nucleotides or amino acids. It first finds a database tembl. The Megamerg takes two overlapping nucleic
phylogenetic tree for the underlying sequences. It then acid sequences and merges them into one sequence. It has the
progressively aligns them one by one based on their ancestral same syntax and input parameters as those of diffseq. The
relationship. Clustal W is programmed in C and can be Shuffleseq takes a sequence as input and outputs one or more
executed as: ./clustalw -batch -infile= <input file> -outfile= sequences whose order has been randomly shuffled. It can be
<output file>, where the input file includes multiple DNA or invoked with the following command line:./shuffleseq -shuffle
protein sequences and the output file records the results after 1000 tembl:af129756 af129756.fasta. It means that the
alignment. The command line used to invoke clustal W with program will shuffle the example nucleic acid for 1000 times
the provided dataset is: ./clustalw -batch -infile=./input.ext - and produce the output file—af129756.fasta
outfile=./output.ext, where input.ext is a query file that 3.2 Molecular Phylogeny Analysis
includes 317 Ureaplasma’s gene sequences from the NCBI
Bacteria genomes database. The output.ext stores the Benchmarks
alignment results among those 317 protein sequences. Phylip: Phylip (PHYLogeny Inference Package) [17] is a
Hmmer: Hmmer [14] employs hidden Markov models package of programs for inferring phylogenies (evolutionary
(profile HMMs) for aligning multiple sequences. Profile trees). Methods that are available in the package include
HMMs are statistical models of multiple sequence alignments. parsimony, distance matrix, maximum likelihood,
They capture position-specific information about how bootstrapping, and consensus trees. Data types that can be
conserved each column of the alignment is, and which residues handled include molecular sequences, gene frequencies,
are likely. Hmmer is programmed in the C language. It restriction sites and fragments, distance matrices, and discrete
4
5. characters. The phylip package is programmed in C. Dnapenny accurate architecture research frameworks (such as
and promlk are the typical applications in the phylip. SimpleScalar, SimAlpha, and M5), we made an extra effort to
Dnapenny is a program that finds all of the most parsimonious produce the Alpha binaries of the majority of BioInfoMark
trees of the input data. Promlk implements the maximum benchmarks. We have tested all pre-compiled Alpha binaries
likelihood method for protein amino acid sequences. They (with static link option) using the Simplescalar sim-outorder
both can run in command line method or interactive method. simulator. The pre-compiled binaries are available in the
To provide deterministic execution, we provide execution BioInfoMark package.
script to invoke the two benchmarks.
3.3 Protein Structure Analysis Benchmarks Table 1. Benchmarks with pre-compiled Alpha binaries (all
binaries have been successfully tested on Simplescalar Sim-
DALI: Dali [18] performs pairwise structure comparison as outorder simulator)
well as finds the structural neighbors of a protein by Benchmark Input Dataset
comparing it against the proteins in the PDB. By default, Dali human LDL receptor precursor protein, NCBI nr
is accessible only through the network, and is too complex and fasta34
database
large to install. So the Dalilite distribution programmed in Perl 317 Ureaplasma’s gene sequences from the NCBI
clustalw
and Fortran 77 is developed for local and efficient use. It has Bacteria genomes database
the core algorithmic functionality of the Dali server. The input a profile HMM built from the alignment of 50 globin
is two sets of atomic coordinates of proteins in PDB format. hmmsearch
sequences, uniprot_sprot.dat from SWISS-PROT
With the provided dataset, Dali can be invoked as: ./DaliLite –
pairwise ./pdb /1DPS.pdb ./pdb /2AV8.pdb, where 1DPS.pdb glimmer2 18 bacteria complete genomes from the NCBI
is a type of DNA-binding protein. 2AV8.pdb is a type of genomes database
oxidoreductase that is an enzyme catalyzes an oxidation- diffseq nucleic acid database EMBL
reduction reaction. megamerger nucleic acid database EMBL
CE: CE (Combinatorial Extension) [19] finds structural shuffleseq nucleic acid database EMBL
similarities between the primary structures of pairs of proteins.
dnapenny ribosomal RNAs from bacteria and mitochondria
CE first aligns small fragments from two proteins. Later, these
fragments are combined and extended to find larger similar protein amino acid sequences of 17 species ranging
promlk
substructures. CE is written in C. It can be invoked as ./CE - from a deep branching bacterium to humans
./1hba.pdb - ./4hhb.pdb - ./scratch, where 1hba.pdb and 100 Eukaryote protein sequences from NCBI
predator
4hhb.pdb are different types of hemoglobin which is used to genomes database
transport oxygen. Scratch is a directory to store temporary
files generated during execution.
3.6 Simulation Points of BioInfoMark
Predator: Predator [20] predicts the secondary structure of Workloads
a protein sequence or a set of sequences based on their amino Our earlier study [4] shows that bioinformatics applications
acid sequences. The Predator is also programmed in C. It can can execution billions of instructions before completion.
be launched using the following command: ./predator -a -l Therefore, it is infeasible to simulate entire benchmark
<seq> -f<output>. With the provided dataset, Predator can be execution using detailed cycle-accurate simulators. Recently,
executed as:./predator -a -l eukaryota_100.seq - the computer architecture research community has widely
feukaryota_100.out, where eukaryota_100.seq includes 100 adopted SimPoint [5] methodology as an efficient way to
Eukaryote protein sequences from NCBI genomes database simulate the representative workload execution phases. We
and eukaryota_100.out is the result of the secondary structure used the SimPoint framework developed by Calder et al. to
prediction. generate the simulation points of the BioInfoMark benchmarks
3.4 Molecular Dynamics Simulation listed in the Table 1. Since the total number of instructions of
Benchmarks different benchmarks varies significantly, we used the criteria
suggested in [22] to determine the size of interval for each
Gamess: Gamess (General Atomic and Molecular individual benchmark. Table 2 lists the interval size as well as
Electronic Structure System) [21] is a general ab initio the simulation points for each benchmark.
quantum chemistry package. Gamess can compute SCF wave
functions and a variety of molecular properties, ranging from 4. CONCLUSIONS
simple dipole moments to frequency dependent
hyperpolarizabilities. Gamess is written in Fortran 77. Gamess Bioinformatics applications represent increasingly important
can be invoked as ./runall >& ./runall.log. It will use 37 short computer workloads. In order to apply a quantitative approach
but diverse examples named EXAM*.INP as the input dataset. in computer architecture design and performance evaluation,
there is a clear need to develop a benchmark suite of
3.5 BioInfoMark Alpha Binaries for Simulation representative bioinformatics applications. This paper presents
based Studies a group of programs representative of bioinformatics software.
To allow computer architecture researchers to simulate the These programs include popular tools used for sequence
BioInfoMark benchmarks using execution-driven and cycle- alignments, molecular phylogeny analysis, protein structure
prediction and molecular dynamics. The benchmark suite
5
6. BioInfoMark is freely available and can be downloaded from International Conference on Parallel Architectures and
www.ideal.ece.ufl.edu/BioInfoMark. In the future, we will Compilation Techniques, 2003.
explore integrated software/hardware techniques to optimize [6] J. S. Nelson, Fishes of the World, John Wiley & Sons, Inc.,
the performance of bioinformatics applications. New York, 1994.
[7] European Molecular Biology Laboratory,
Table 2. The simulation points of BioInfoMark Alpha binaries
http://www.embl-heidelberg.de
Benchmark Interval Simulation Points
(M) [8] DNA Data Bank of Japan, http://www.ddbj.nig.ac.jp/
412,698,242,326,810,961,503,487,354,105
fasta34 [9] The RCSB Protein Data Bank, http://www.rcsb.org/pdb/
400 1,932,8,832,792,459,988,54,107,482,808,1
36,554,996,588 [10] The UniProt/Swiss-Prot Database,
24,918,906,701,702,674,793,395,252,845,7 http://www.ebi.ac.uk/swissprot/
clustalw 850
19,883,857,585,858,651,381,817 [11] S. Altschul, W. Gish, W. Miller, E. W. Meyers and D. J.
Lipman, Basic Local Alignment Search Tool, Journal of
340,674,330,695,711,75,619,599,54,677,55
hmmsearch 680
1,672,794,40,404,682,618,370,457,951 Molecular Biology, vol. 215, no. 3, pages 403-410, 1990.
[12] W.R. Pearson and D.J. Lipman, Improved tools for
581,1,45,70,32,13,2,17,1567,8,404,26,84,1
glimmer2 20 biological sequence comparison, Proc. Natl. Acad. Sci., 85
572,6,21,494,1109,40,1240
(1988), 3244–3248.
705,680,444,597,442,255,662,10,3,977,288
diffseq 35 ,443,1006,827,990,1004,343,927,707,256,9 [13] J. D. Thompson, D.G. Higgins, and T.J. Gibson, Clustal W:
8,1054,689,964,780,958,824,942 Improving the Sensitivity of Progressive Multiple Sequence
Alignment through Sequence Weighting, Positions-specific
703,559,254,596,679,443,377,825,259,106 Gap Penalties and Weight Matrix Choice, Nucleic Acids
megamerger 35 7,3,442,901,255,310,593,94,773,976,758,8
Research, vol. 22, no. 22, pages 4673-4680, 1994.
89,781,1058,818,5,560
[14] S. R. Eddy, Profile Hidden Markov Models, Bioinformatics
318,1,718,981,280,115,18,355,261,365,101 Review, vol. 14, no. 9, page 755-763, 1998.
shuffleseq 300 9,457,194,1018,196,43,254,406,226,775,84
2,454,894,986,776 [15] S. Salzberg, A. Delcher, S. Kasif, and O. White, Microbial
Gene Identification using Interpolated Markov Models,
95,1048,1043,225,672,1,61,413,185,627,51
dnapenny 140 ,1014,1056,2,110,911,200,1003,777,970,63
Nucleic Acids Research, vol. 26, no. 2, page 544-548, 1998.
9,1053,40,438,576,597 [16] P. Rice, I. Longden, and A. Bleasby, EMBOSS: The
21,911,255,1,320,548,876,713,176,319,813 European Molecular Biology Open Software Suite, Trends in
promlk 320 ,332,818,932,445,807,909,773,580,854,224 Genetics, vol. 16, no 6, page 276-277, 2000.
,719,700,969,472,715,1008,661,375,210
[17] J. Felsenstein, PHYLIP - Phylogeny Inference Package
272,247,758,195,1143,536,535,166,585,81, (version 3.2), Cladistics, 5: 164-166, 1989.
predator 700 233,296,482,640,429,406,88,343,203,403,4
79,955,37,971 [18] L. Holm and J. Park, DaliLite Workbench for Protein
Structure, Bioinformatics Applications Note, vol. 16, no.6,
5. REFERENCES pages 566- 567, 2000.
[1] http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html [19] I. N. Shindyalov, and P. E. Bourne, Protein Structure
Alignment by Incremental Combinatorial Extension (CE) of
[2] Bioinformation Market Study for Washington Technology the Optimal Path, Protein Engineering, vol. 11, no. 99, page
Center, Alta Biomedical Group LLC, 739-747, 1998.
www.altabiomedical.com, June 2003.
[20] D. Frishman, and P. Argos, 75% Accuracy in Protein
[3] K. Albayraktaroglu et al., BioBench: A Benchmark Suite of Secondary Structure Prediction, Proteins, vol. 27, page 329-
Bioinformatics Applications, International Symposium on 335, 1997.
Performance Analysis of Software and Systems, 2005.
[21] M. W. Schmidt, et al General Atomic and Molecular
[4] Y. Li, T. Li, T. Kahveci and J. Fortes, Workload Electronic Structure System, Journal of Comput. Chem., vol.
Characterization of Bioinformatics Applications on Pentium 14, page 1347-1363, 1993.
4 Architecture, In Proceedings of the International
Symposium on Modeling, Analysis, and Simulation of [22] G. Hamerly, E. Perelman and B. Calder, How to Use
Computer and Telecommunication Systems, 2005. SimPoint to Pick Simulation Points, ACM SIGMETRICS
Performance Evaluation Review, 2004.
[5] E. Perelman, G. Hamerly and B. Calder, Picking Statistically
Valid and Early Simulation Points, In Proceedings of the
6