BioInfoMark: A Bioinformatic Benchmark Suite for Computer                           Architecture Research                 ...
The rest of this paper is organized as follows. Section 2provides an introductory background on biology and a briefreview ...
following example illustrates a multiple alignment among thesequences A= “AGGTCAGTCTAGGAC”, B= “GGACTGAGGTC”, andC=“GAGGAC...
developed a suite of representative bioinformatics workloads -         includes several applications such as hmmbuild, hmm...
characters. The phylip package is programmed in C. Dnapenny              accurate architecture research frameworks (such a...
BioInfoMark is freely available and can be downloaded from                      International Conference on Parallel Archi...
Upcoming SlideShare
Loading in …5
×

57 bio infomark

555 views
534 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
555
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

57 bio infomark

  1. 1. BioInfoMark: A Bioinformatic Benchmark Suite for Computer Architecture Research Yue Li and Tao Li Intelligent Design of Efficient Architecture Lab(IDEAL) Department of Electrical and Computer Engineering University of Florida yli@ecel.ufl.edu, taoli@ece.ufl.eduABSTRACT Bioinformatics allows researchers to sift through the massive bioinformatics data (e.g., nucleic acid and protein sequences, The exponential growth in the amount of genomic data has structures, functions, pathways and interactions) and identifyspurred growing interest in large scale analysis of genetic information of interest.information. Bioinformatics applications, which explore Today, bioinformatics has become an industry and hascomputational methods to allow researchers to sift through the gained popularity among numerous markets includingmassive biological data and extract useful information, are pharmaceutical, (industrial, agricultural and environmental)becoming increasingly important computer workloads. This biotechnology and homeland security. A number of recentpaper presents BioInfoMark, a benchmark suite of market research reports estimate the size of the bioinformaticsrepresentative bioinformatics applications to facilitate the market as $176 billion and the market is projected to grow todesign and evaluation of computer architectures for these $243 billion within the next 5 years [2]. In August 2000, IBMemerging workloads. Currently, the BioInfoMark suite announced an initial $100 million investment to spur businesscontains 14 highly popular bioinformatics tools and covers the development in the life sciences, assuring its prominence asmajor fields of study in computational biology such as one of the emerging computing markets. There are over 1,300sequence comparison, phylogenetic analysis, protein structure biotech companies only in US and some 1,600 companies inanalysis, and molecular dynamics simulation. Europe. The BioInfoMark package includes benchmark source code, Clearly, computer systems that can cost-effectively deliverinput datasets and information for compiling and using the high-performance on computational biology applications playbenchmarks. To allow computer architecture researchers to a vita role in the future growth of the bioinformatics market. Inrun the BioInfoMark suite on several popular execution driven order to apply a quantitative approach in computer architecturesimulators, we provide pre-compiled little-endian Alpha ISA design, optimization and performance evaluation, researchersbinaries and generated simulation points. The BioInfoMark need to identify representative workloads from this emergingpackage is freely available and can be downloaded from: application domain first.http://www.ideal.ece.ufl.edu/BioInfoMark. This paper presents BioInfoMark, a benchmark suite of1. INTRODUCTION representative bioinformatics applications to facilitate the design and evaluation of computer architectures for these The major breakthrough in the field of molecular biology, emerging workloads. Currently, the BioInfoMark suitecoupled with advances in genomic technologies has led to an contains 14 highly popular bioinformatics tools and covers theexplosive growth in the area of informatics. For example, the major fields of study in computational biology such asNational Center for Biotechnology Information (NCBI) sequence comparison, phylogenetic analysis, protein structureGenBank, an annotated collection of all publicly available analysis, and molecular dynamics simulation. SequenceDNA sequences, has been growing at an exponential rate. comparison finds similarities between two or more DNA or protein sequences. Phylogeny explores the ancestral relationship among a set of genes or organisms. Protein structure analysis (a) finds the similarities between three- dimensional protein structures and (b) predicts the shape of a protein (e.g., primary, secondary, and tertiary structure) given its amino acid sequence. Molecular dynamics explores the interactions among biomolecules. Compared with a recent study reported in [3], our independent work covers many more bioinformatics tools in terms of quantity and diversity. To allow computer architecture researchers to run the BioInfoMark suite onFigure 1. Growth of GenBank: There are 49,398,852,122 bases several popular execution driven simulators (e.g. Simplescalar,from 45,236,251 sequences in GenBank as of Jun. 15 2005 SimAlpha, and M5), we put additional effort into providing(source: [1]). pre-compiled little-endian Alpha ISA binaries and generating As genomics moves forward, having accessible simulation points. A detailed, quantitative workloadcomputational methods with which to extract, view, and characterization of the BioInfoMark benchmarks on Pentium 4analyze genomic information, becomes essential. microarchitecture can be found in [4]. 1
  2. 2. The rest of this paper is organized as follows. Section 2provides an introductory background on biology and a briefreview of bioinformatics study areas. Section 3 describes theBioInfoMark suite, including benchmark functionality, inputdatasets, benchmark compilation and execution, the pre-compiled Alpha binaries, and the generated simulationpoints[5]. Section 4 concludes the paper and outlines ourfuture work.2. BACKGROUND To help readers to understand the BioInfoMark benchmarksbetter, we first provide an introductory background on biology Figure 3. Three dimensional structure of human foetaland illustrate the major areas of bioinformatics. deoxyhaemoglobin (PDB id = 1FDH).2.1 Introduction: DNA, Gene and Proteins 2.2 Bioinformatics Problems One of the fundamental principles of biology is that within In this section, we illustrate the major problems ineach cell, DNA that comprises the genes encodes RNA which bioinformatics, including sequence analysis, phylogeny,in turn produces the proteins that regulate all of the biological protein structure analysis/prediction and molecular dynamics.processes within an organism. 2.2.1 Sequence Analysis DNA is a double chain of simpler molecules callednucleotides, tied together in a double helix helical structure Sequence analysis is perhaps the most commonly performed(Figure 2). The nucleotides are distinguished by a nitrogen task in bioinformatics. Sequence analysis can be defined as thebase that can be of four kinds: adenine (A), cytosine (C), problem of finding which parts of the sequences (nucleotide orguanine (G) and thymine (T). Adenine (A) always bonds to amino acid sequences) are similar and which parts are different.thymine (T) whereas cytosine (C) always bonds to guanine (G), By comparing sequences, researchers can gain crucialforming base pairs. A DNA can be specified uniquely by understanding of their significance and functionality: highlisting its sequence of nucleotides, or base pairs. Proteins are sequence similarity usually implies significant functional ormolecules that accomplish most of the functions of a living structural similarity while sequence differences hold the keycell, determining its shape and structure. A protein is a linear information regarding diversity and evolution.sequence of molecules called amino acids. Twenty different The most commonly used sequence analysis technique isamino acids are commonly found in proteins. Similar to DNA, pairwise sequence comparison. A sequence can beproteins are conveniently represented as a string of letters transformed to another sequence with the help of three editexpressing their sequence of amino acids. operations. Each edit operation can insert a new letter, delete an existing letter, or replace an existing letter with a new one. The alignment of two sequences is defined by the edit operations that transform one into the other. This is usually represented by writing one on top of the other. Insertions and deletions (i.e., gaps) are represented by the dash symbol (“-”). The following example illustrates an alignment between the sequences A= “GAATTCAGTA” and B= “GGATCGTTA”. The objective is to match identical subsequences as far as possible (or equivalently use as few edit operations as possible). In the example, the aligned sequences match in seven positions. Figure 2. DNA molecule. Sequence A GAATTCAGT-A R D D I Once a protein is produced, it folds into a three-dimensional Sequence B GGA-TC-GTTAshape. The positions of the central atoms, called carbon-alpha(C∝), of the amino acids of a protein define its primary Figure 4. Alignment of two sequences (The aligned sequencesstructure. If a contiguous subsequence of C∝ atoms follows match in seven positions. One replace, two delete, and onesome predefined pattern, they are classified as a secondary insert operations, shown by letters R, D, and I, are used.)structure, such as alpha-helix or beta-sheet. The relative Alignment of sequences is considered in two different butpositioning of the secondary structures define the tertiary related classes: If the entire sequences are aligned, then it isstructure. The overall shape of all chains of a protein then called a global alignment. If subsequences of two sequencesdefines the quaternary structure. are aligned, then it is called a local alignment. Multiple sequence alignment compares more than two sequences: all sequences are aligned on top of each other. Each column is the alignment of one letter from each sequence. The 2
  3. 3. following example illustrates a multiple alignment among thesequences A= “AGGTCAGTCTAGGAC”, B= “GGACTGAGGTC”, andC=“GAGGACTGGCTACGGAC”. Sequence A -AGGTCAGTCTA-GGAC Sequence B --GGACTGA----GGTC Sequence C GAGGACTGGCTACGGACFigure 5. Multiple alignment of three DNA sequences A, B, and C.2.2.2 Molecular Phylogeny Analysis Figure 7. The structural similarity between two proteins. Molecular phylogeny infers lines of ancestry of genes or (source http://cl.sdsc.edu/)organisms. Phylogeny analysis provides crucial understandingabout the origins of life and the homology of various species 2.2.4 Molecular Dynamicson earth. Phylogenetic trees are composed of nodes and In the broadest sense, molecular dynamics is concerned withbranches. Each leaf node corresponds to a gene or an organism. molecular motion. Motion is inherent to all chemical processes.Internal nodes represent inferred ancestors. The evolutionary Simple vibrations, like bond stretching and angle bending,distance between two genes or organisms is computed as a give rise to IR spectra. Chemical reactions, hormone-receptorfunction of the length of the branches between their nodes and binding, and other complex processes are associated withtheir common ancestors. many kinds of intra- and intermolecular motions. Molecular dynamics allows the studying of the dynamics of large macromolecules, including biological systems such as proteins, nucleic acids (DNA, RNA), and membranes. Dynamic events may play a key role in controlling processes which affect functional properties of biomolecules. Drug design is commonly used in the pharmaceutical industry to test properties of a molecule at the computer without the need to synthesize it (which is far more expensive). 2.3 Bioinformatics Databases A bioinformatics database is an organized body of persistent Figure 6. Evolutionary relationships between fish models. data (e.g. nucleotide and amino acid sequences, three- Figure 6 shows evolutionary relationships between fish dimensional structure). Thanks to the human genome project,models. This evolutionary tree (based on data from Fishes of there has been a growing interest both in the public and privatethe World by J. S. Nelson) [6] illustrates that the last common sectors towards creating bioinformatics databases. At the endancestor of medaka and zebrafish lived more than 110 million of 2002, there were more than 300 molecular biologyyears (Myr) ago. databases available worldwide. This section provides a brief overview of several popular and publicly available2.2.3 Protein Structure Analysis bioinformatics databases. Two protein substructures are called similar if their C∝ An important class of bioinformatics databases is theatoms can be mapped to close by points after translation and sequence database. The largest sequence database is therotation of one of the proteins. This can also be considered as a NCBI/GenBank [1] which collects all known nucleotide andone to one mapping of amino acids. Usually, structural protein sequences. Other major data sources are EMBLsimilarity requires that the amino acid pairs that are considered (European Molecular Biology Lab) [7] and DDBJ (DNA Datasimilar have the same secondary structure type. Structural Bank of Japan) [8]. Two major sources of protein sequencessimilarities among proteins provide insight regarding their and structures are PDB (Protein Data Bank) [9], and SWISS-functional relationship. Figure 7 presents the structural PROT [10]. PDB contains the protein structures determined bysimilarity of two proteins. NMR and X-ray crystallography techniques. SWISS-PROT is a Three-dimensional structures of only a small subset of curated protein sequence database which provides a high levelproteins are known as it requires expensive wet-lab of annotation such as description of protein function, itsexperimentation. Computationally determining the structure of domain structure, post-translational modification and otherproteins is an important problem as it accelerates the useful information.experimentation step and reduces expert analysis. Usually, therelationship among chemical components of proteins (i.e. their 3. THE BIOINFORMATICS BENCHMARKamino acid sequences) is used in determining their unique SUITE: BIOINFOMARKthree-dimensional native structures. To allow computer architecture researchers to explore and evaluate their designs on these emerging applications, we 3
  4. 4. developed a suite of representative bioinformatics workloads - includes several applications such as hmmbuild, hmmcalibrateBioInfoMark. Currently, the BioInfoMark package contains 14 and hmmsearch. Among these applications, the hmmsearch isapplications, which covers a variety of major important widely used to search a sequence database for matches to anbioinformatics tools ranging from sequence comparison to HMM. The syntax of invoking benchmark hmmsearchmolecular dynamics. This section describes the selected is:./hmmsearch <input file> <database file>. With theprograms, which can be classified using the categories we provided dataset, this benchmark can be executedintroduced in Section 2.2. as: ./hmmsearch ./globin.hmm ./Artemia.fa, where globin.hmm3.1 Sequence Analysis Benchmarks is the example HMM built from the alignment file of 50 aligned globin sequences and the Artemia.fa is a FASTA file Blast: The Blast (Basic Local Alignment Search Tool) of brine shrimp globin, which contains nine tandemly repeatedprograms [11] are a set of heuristic methods that are used to globin domains.search sequence databases for local alignments to a query Glimmer: Glimmer (Gene Locator and Interpolated Markovsequence. The Blast programs are written in C. BlastP and Modeler) [15] finds genes in microbial DNA. Its usesBlastN are the versions of Blast for protein and nucleotide interpolated Markov models (IMMs) to identify coding andsequences respectively. All Blast programs can be executed noncoding regions in the DNA. Glimmer is written in the C++using the following command line: ./blastall –p <option> -i language and it can be executed as ./glimmer2 <input<query file> -d <database file> -o <output file>. The option sequence> <model file>. The command to invoke thisis “blastp” for searching protein sequences or “blastn” for benchmark on the given datasetsearching nucleotide sequences. The query file is the file is ./glimmer2 ./NC_000907.fna ./glimmer.icm, wherewhich includes the nucleotide or protein sequence for search. NC_000907.fna is a kind of bacterium whose name isThe database file is the database which will be searched. With Haemophilus_influenzae and glimmer.icm is the collection filethe provided dataset, the benchmark can be invoked as of Markov models.follows: ./blastall –p blastp –i target.txt –d nr –o output, Emboss: Emboss (European Molecular Biology Openwhere target.txt is the homo sapiens hereditary Software Suite) [16] is a software package programmed in C,haemochromatosis protein sequence and nr is the non- which contains a wide variety of programs ranging fromredundant protein sequence database NCBI. sequence alignment, protein motif identification to domain Fasta: Similar to Blast, Fasta [12] is a collection of local analysis, and codon usage analysis. Diffseq, megamerg andsimilarity search programs for sequence databases. While shuffleseq are three representatives in the Emboss. DiffseqFasta and Blast both do pairwise local alignment, their takes two overlapping, nearly identical sequences and reportsunderlying algorithms are different. Fasta is programmed in C the differences between them, together with any features thatand can be invoked using the following command overlap with these regions. The syntax of this benchmarkline: ./fasta34 <query file> <database file>. The query file execution is ./diffseq <seq1> <seq2> -wordsize <output>,and database file have the same meaning as those of Blast. where seq1 and seq2 are two sequences for comparison. TheWith the provided dataset, the Fasta benchmark can be wordsize refers to the size of which the program does a matchinvoked as: ./fasta34 ./qrhuld.aa ../database/nr > ./output.txt, of all sequence words. The output records the result afterwhere qrhuld.aa is a query file that contains the human LDL differentiating both two sequences. With the provided dataset,receptor precursor protein. The nr is the same database diffseq can be invoked as:./diffseq tembl:ap000504mentioned above. tembl:af129756 -wordsize 6 report, where ap000504 and Clustal W: Clustal W [13] is a multiple sequence alignment af129756 are two homo sapiens genes in the nucleic acidprogram for nucleotides or amino acids. It first finds a database tembl. The Megamerg takes two overlapping nucleicphylogenetic tree for the underlying sequences. It then acid sequences and merges them into one sequence. It has theprogressively aligns them one by one based on their ancestral same syntax and input parameters as those of diffseq. Therelationship. Clustal W is programmed in C and can be Shuffleseq takes a sequence as input and outputs one or moreexecuted as: ./clustalw -batch -infile= <input file> -outfile= sequences whose order has been randomly shuffled. It can be<output file>, where the input file includes multiple DNA or invoked with the following command line:./shuffleseq -shuffleprotein sequences and the output file records the results after 1000 tembl:af129756 af129756.fasta. It means that thealignment. The command line used to invoke clustal W with program will shuffle the example nucleic acid for 1000 timesthe provided dataset is: ./clustalw -batch -infile=./input.ext - and produce the output file—af129756.fastaoutfile=./output.ext, where input.ext is a query file that 3.2 Molecular Phylogeny Analysisincludes 317 Ureaplasma’s gene sequences from the NCBIBacteria genomes database. The output.ext stores the Benchmarksalignment results among those 317 protein sequences. Phylip: Phylip (PHYLogeny Inference Package) [17] is a Hmmer: Hmmer [14] employs hidden Markov models package of programs for inferring phylogenies (evolutionary(profile HMMs) for aligning multiple sequences. Profile trees). Methods that are available in the package includeHMMs are statistical models of multiple sequence alignments. parsimony, distance matrix, maximum likelihood,They capture position-specific information about how bootstrapping, and consensus trees. Data types that can beconserved each column of the alignment is, and which residues handled include molecular sequences, gene frequencies,are likely. Hmmer is programmed in the C language. It restriction sites and fragments, distance matrices, and discrete 4
  5. 5. characters. The phylip package is programmed in C. Dnapenny accurate architecture research frameworks (such asand promlk are the typical applications in the phylip. SimpleScalar, SimAlpha, and M5), we made an extra effort toDnapenny is a program that finds all of the most parsimonious produce the Alpha binaries of the majority of BioInfoMarktrees of the input data. Promlk implements the maximum benchmarks. We have tested all pre-compiled Alpha binarieslikelihood method for protein amino acid sequences. They (with static link option) using the Simplescalar sim-outorderboth can run in command line method or interactive method. simulator. The pre-compiled binaries are available in theTo provide deterministic execution, we provide execution BioInfoMark package.script to invoke the two benchmarks.3.3 Protein Structure Analysis Benchmarks Table 1. Benchmarks with pre-compiled Alpha binaries (all binaries have been successfully tested on Simplescalar Sim- DALI: Dali [18] performs pairwise structure comparison as outorder simulator)well as finds the structural neighbors of a protein by Benchmark Input Datasetcomparing it against the proteins in the PDB. By default, Dali human LDL receptor precursor protein, NCBI nris accessible only through the network, and is too complex and fasta34 databaselarge to install. So the Dalilite distribution programmed in Perl 317 Ureaplasma’s gene sequences from the NCBI clustalwand Fortran 77 is developed for local and efficient use. It has Bacteria genomes databasethe core algorithmic functionality of the Dali server. The input a profile HMM built from the alignment of 50 globinis two sets of atomic coordinates of proteins in PDB format. hmmsearch sequences, uniprot_sprot.dat from SWISS-PROTWith the provided dataset, Dali can be invoked as: ./DaliLite –pairwise ./pdb /1DPS.pdb ./pdb /2AV8.pdb, where 1DPS.pdb glimmer2 18 bacteria complete genomes from the NCBIis a type of DNA-binding protein. 2AV8.pdb is a type of genomes databaseoxidoreductase that is an enzyme catalyzes an oxidation- diffseq nucleic acid database EMBLreduction reaction. megamerger nucleic acid database EMBL CE: CE (Combinatorial Extension) [19] finds structural shuffleseq nucleic acid database EMBLsimilarities between the primary structures of pairs of proteins. dnapenny ribosomal RNAs from bacteria and mitochondriaCE first aligns small fragments from two proteins. Later, thesefragments are combined and extended to find larger similar protein amino acid sequences of 17 species ranging promlksubstructures. CE is written in C. It can be invoked as ./CE - from a deep branching bacterium to humans ./1hba.pdb - ./4hhb.pdb - ./scratch, where 1hba.pdb and 100 Eukaryote protein sequences from NCBI predator4hhb.pdb are different types of hemoglobin which is used to genomes databasetransport oxygen. Scratch is a directory to store temporaryfiles generated during execution. 3.6 Simulation Points of BioInfoMark Predator: Predator [20] predicts the secondary structure of Workloadsa protein sequence or a set of sequences based on their amino Our earlier study [4] shows that bioinformatics applicationsacid sequences. The Predator is also programmed in C. It can can execution billions of instructions before completion.be launched using the following command: ./predator -a -l Therefore, it is infeasible to simulate entire benchmark<seq> -f<output>. With the provided dataset, Predator can be execution using detailed cycle-accurate simulators. Recently,executed as:./predator -a -l eukaryota_100.seq - the computer architecture research community has widelyfeukaryota_100.out, where eukaryota_100.seq includes 100 adopted SimPoint [5] methodology as an efficient way toEukaryote protein sequences from NCBI genomes database simulate the representative workload execution phases. Weand eukaryota_100.out is the result of the secondary structure used the SimPoint framework developed by Calder et al. toprediction. generate the simulation points of the BioInfoMark benchmarks3.4 Molecular Dynamics Simulation listed in the Table 1. Since the total number of instructions ofBenchmarks different benchmarks varies significantly, we used the criteria suggested in [22] to determine the size of interval for each Gamess: Gamess (General Atomic and Molecular individual benchmark. Table 2 lists the interval size as well asElectronic Structure System) [21] is a general ab initio the simulation points for each benchmark.quantum chemistry package. Gamess can compute SCF wavefunctions and a variety of molecular properties, ranging from 4. CONCLUSIONSsimple dipole moments to frequency dependenthyperpolarizabilities. Gamess is written in Fortran 77. Gamess Bioinformatics applications represent increasingly importantcan be invoked as ./runall >& ./runall.log. It will use 37 short computer workloads. In order to apply a quantitative approachbut diverse examples named EXAM*.INP as the input dataset. in computer architecture design and performance evaluation, there is a clear need to develop a benchmark suite of3.5 BioInfoMark Alpha Binaries for Simulation representative bioinformatics applications. This paper presentsbased Studies a group of programs representative of bioinformatics software. To allow computer architecture researchers to simulate the These programs include popular tools used for sequenceBioInfoMark benchmarks using execution-driven and cycle- alignments, molecular phylogeny analysis, protein structure prediction and molecular dynamics. The benchmark suite 5
  6. 6. BioInfoMark is freely available and can be downloaded from International Conference on Parallel Architectures andwww.ideal.ece.ufl.edu/BioInfoMark. In the future, we will Compilation Techniques, 2003.explore integrated software/hardware techniques to optimize [6] J. S. Nelson, Fishes of the World, John Wiley & Sons, Inc.,the performance of bioinformatics applications. New York, 1994. [7] European Molecular Biology Laboratory,Table 2. The simulation points of BioInfoMark Alpha binaries http://www.embl-heidelberg.deBenchmark Interval Simulation Points (M) [8] DNA Data Bank of Japan, http://www.ddbj.nig.ac.jp/ 412,698,242,326,810,961,503,487,354,105fasta34 [9] The RCSB Protein Data Bank, http://www.rcsb.org/pdb/ 400 1,932,8,832,792,459,988,54,107,482,808,1 36,554,996,588 [10] The UniProt/Swiss-Prot Database, 24,918,906,701,702,674,793,395,252,845,7 http://www.ebi.ac.uk/swissprot/clustalw 850 19,883,857,585,858,651,381,817 [11] S. Altschul, W. Gish, W. Miller, E. W. Meyers and D. J. Lipman, Basic Local Alignment Search Tool, Journal of 340,674,330,695,711,75,619,599,54,677,55hmmsearch 680 1,672,794,40,404,682,618,370,457,951 Molecular Biology, vol. 215, no. 3, pages 403-410, 1990. [12] W.R. Pearson and D.J. Lipman, Improved tools for 581,1,45,70,32,13,2,17,1567,8,404,26,84,1glimmer2 20 biological sequence comparison, Proc. Natl. Acad. Sci., 85 572,6,21,494,1109,40,1240 (1988), 3244–3248. 705,680,444,597,442,255,662,10,3,977,288diffseq 35 ,443,1006,827,990,1004,343,927,707,256,9 [13] J. D. Thompson, D.G. Higgins, and T.J. Gibson, Clustal W: 8,1054,689,964,780,958,824,942 Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Positions-specific 703,559,254,596,679,443,377,825,259,106 Gap Penalties and Weight Matrix Choice, Nucleic Acidsmegamerger 35 7,3,442,901,255,310,593,94,773,976,758,8 Research, vol. 22, no. 22, pages 4673-4680, 1994. 89,781,1058,818,5,560 [14] S. R. Eddy, Profile Hidden Markov Models, Bioinformatics 318,1,718,981,280,115,18,355,261,365,101 Review, vol. 14, no. 9, page 755-763, 1998.shuffleseq 300 9,457,194,1018,196,43,254,406,226,775,84 2,454,894,986,776 [15] S. Salzberg, A. Delcher, S. Kasif, and O. White, Microbial Gene Identification using Interpolated Markov Models, 95,1048,1043,225,672,1,61,413,185,627,51dnapenny 140 ,1014,1056,2,110,911,200,1003,777,970,63 Nucleic Acids Research, vol. 26, no. 2, page 544-548, 1998. 9,1053,40,438,576,597 [16] P. Rice, I. Longden, and A. Bleasby, EMBOSS: The 21,911,255,1,320,548,876,713,176,319,813 European Molecular Biology Open Software Suite, Trends inpromlk 320 ,332,818,932,445,807,909,773,580,854,224 Genetics, vol. 16, no 6, page 276-277, 2000. ,719,700,969,472,715,1008,661,375,210 [17] J. Felsenstein, PHYLIP - Phylogeny Inference Package 272,247,758,195,1143,536,535,166,585,81, (version 3.2), Cladistics, 5: 164-166, 1989.predator 700 233,296,482,640,429,406,88,343,203,403,4 79,955,37,971 [18] L. Holm and J. Park, DaliLite Workbench for Protein Structure, Bioinformatics Applications Note, vol. 16, no.6,5. REFERENCES pages 566- 567, 2000.[1] http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html [19] I. N. Shindyalov, and P. E. Bourne, Protein Structure Alignment by Incremental Combinatorial Extension (CE) of[2] Bioinformation Market Study for Washington Technology the Optimal Path, Protein Engineering, vol. 11, no. 99, page Center, Alta Biomedical Group LLC, 739-747, 1998. www.altabiomedical.com, June 2003. [20] D. Frishman, and P. Argos, 75% Accuracy in Protein[3] K. Albayraktaroglu et al., BioBench: A Benchmark Suite of Secondary Structure Prediction, Proteins, vol. 27, page 329- Bioinformatics Applications, International Symposium on 335, 1997. Performance Analysis of Software and Systems, 2005. [21] M. W. Schmidt, et al General Atomic and Molecular[4] Y. Li, T. Li, T. Kahveci and J. Fortes, Workload Electronic Structure System, Journal of Comput. Chem., vol. Characterization of Bioinformatics Applications on Pentium 14, page 1347-1363, 1993. 4 Architecture, In Proceedings of the International Symposium on Modeling, Analysis, and Simulation of [22] G. Hamerly, E. Perelman and B. Calder, How to Use Computer and Telecommunication Systems, 2005. SimPoint to Pick Simulation Points, ACM SIGMETRICS Performance Evaluation Review, 2004.[5] E. Perelman, G. Hamerly and B. Calder, Picking Statistically Valid and Early Simulation Points, In Proceedings of the 6

×