SlideShare a Scribd company logo
1 of 6
Download to read offline
BioInfoMark: A Bioinformatic Benchmark Suite for Computer
                           Architecture Research
                                                      Yue Li and Tao Li
                                     Intelligent Design of Efficient Architecture Lab(IDEAL)
                                       Department of Electrical and Computer Engineering
                                                       University of Florida
                                                yli@ecel.ufl.edu, taoli@ece.ufl.edu
ABSTRACT                                                             Bioinformatics allows researchers to sift through the massive
                                                                     bioinformatics data (e.g., nucleic acid and protein sequences,
   The exponential growth in the amount of genomic data has          structures, functions, pathways and interactions) and identify
spurred growing interest in large scale analysis of genetic          information of interest.
information. Bioinformatics applications, which explore                 Today, bioinformatics has become an industry and has
computational methods to allow researchers to sift through the       gained popularity among numerous markets including
massive biological data and extract useful information, are          pharmaceutical, (industrial, agricultural and environmental)
becoming increasingly important computer workloads. This             biotechnology and homeland security. A number of recent
paper presents BioInfoMark, a benchmark suite of                     market research reports estimate the size of the bioinformatics
representative bioinformatics applications to facilitate the         market as $176 billion and the market is projected to grow to
design and evaluation of computer architectures for these            $243 billion within the next 5 years [2]. In August 2000, IBM
emerging workloads. Currently, the BioInfoMark suite                 announced an initial $100 million investment to spur business
contains 14 highly popular bioinformatics tools and covers the       development in the life sciences, assuring its prominence as
major fields of study in computational biology such as               one of the emerging computing markets. There are over 1,300
sequence comparison, phylogenetic analysis, protein structure        biotech companies only in US and some 1,600 companies in
analysis, and molecular dynamics simulation.                         Europe.
   The BioInfoMark package includes benchmark source code,              Clearly, computer systems that can cost-effectively deliver
input datasets and information for compiling and using the           high-performance on computational biology applications play
benchmarks. To allow computer architecture researchers to            a vita role in the future growth of the bioinformatics market. In
run the BioInfoMark suite on several popular execution driven        order to apply a quantitative approach in computer architecture
simulators, we provide pre-compiled little-endian Alpha ISA          design, optimization and performance evaluation, researchers
binaries and generated simulation points. The BioInfoMark            need to identify representative workloads from this emerging
package is freely available and can be downloaded from:              application domain first.
http://www.ideal.ece.ufl.edu/BioInfoMark.                               This paper presents BioInfoMark, a benchmark suite of
1. INTRODUCTION                                                      representative bioinformatics applications to facilitate the
                                                                     design and evaluation of computer architectures for these
  The major breakthrough in the field of molecular biology,          emerging workloads. Currently, the BioInfoMark suite
coupled with advances in genomic technologies has led to an          contains 14 highly popular bioinformatics tools and covers the
explosive growth in the area of informatics. For example, the        major fields of study in computational biology such as
National Center for Biotechnology Information (NCBI)                 sequence comparison, phylogenetic analysis, protein structure
GenBank, an annotated collection of all publicly available           analysis, and molecular dynamics simulation. Sequence
DNA sequences, has been growing at an exponential rate.              comparison finds similarities between two or more DNA or
                                                                     protein sequences. Phylogeny explores the ancestral
                                                                     relationship among a set of genes or organisms. Protein
                                                                     structure analysis (a) finds the similarities between three-
                                                                     dimensional protein structures and (b) predicts the shape of a
                                                                     protein (e.g., primary, secondary, and tertiary structure) given
                                                                     its amino acid sequence. Molecular dynamics explores the
                                                                     interactions among biomolecules.
                                                                        Compared with a recent study reported in [3], our
                                                                     independent work covers many more bioinformatics tools in
                                                                     terms of quantity and diversity. To allow computer
                                                                     architecture researchers to run the BioInfoMark suite on
Figure 1. Growth of GenBank: There are 49,398,852,122 bases          several popular execution driven simulators (e.g. Simplescalar,
from 45,236,251 sequences in GenBank as of Jun. 15 2005              SimAlpha, and M5), we put additional effort into providing
(source: [1]).
                                                                     pre-compiled little-endian Alpha ISA binaries and generating
  As genomics moves forward, having accessible                       simulation points. A detailed, quantitative workload
computational methods with which to extract, view, and               characterization of the BioInfoMark benchmarks on Pentium 4
analyze   genomic    information, becomes     essential.             microarchitecture can be found in [4].


                                                                 1
The rest of this paper is organized as follows. Section 2
provides an introductory background on biology and a brief
review of bioinformatics study areas. Section 3 describes the
BioInfoMark suite, including benchmark functionality, input
datasets, benchmark compilation and execution, the pre-
compiled Alpha binaries, and the generated simulation
points[5]. Section 4 concludes the paper and outlines our
future work.

2. BACKGROUND
  To help readers to understand the BioInfoMark benchmarks
better, we first provide an introductory background on biology              Figure 3. Three dimensional structure of human foetal
and illustrate the major areas of bioinformatics.                                    deoxyhaemoglobin (PDB id = 1FDH).

2.1 Introduction: DNA, Gene and Proteins                                2.2 Bioinformatics Problems
   One of the fundamental principles of biology is that within            In this section, we illustrate the major problems in
each cell, DNA that comprises the genes encodes RNA which               bioinformatics, including sequence analysis, phylogeny,
in turn produces the proteins that regulate all of the biological       protein structure analysis/prediction and molecular dynamics.
processes within an organism.                                           2.2.1 Sequence Analysis
   DNA is a double chain of simpler molecules called
nucleotides, tied together in a double helix helical structure             Sequence analysis is perhaps the most commonly performed
(Figure 2). The nucleotides are distinguished by a nitrogen             task in bioinformatics. Sequence analysis can be defined as the
base that can be of four kinds: adenine (A), cytosine (C),              problem of finding which parts of the sequences (nucleotide or
guanine (G) and thymine (T). Adenine (A) always bonds to                amino acid sequences) are similar and which parts are different.
thymine (T) whereas cytosine (C) always bonds to guanine (G),           By comparing sequences, researchers can gain crucial
forming base pairs. A DNA can be specified uniquely by                  understanding of their significance and functionality: high
listing its sequence of nucleotides, or base pairs. Proteins are        sequence similarity usually implies significant functional or
molecules that accomplish most of the functions of a living             structural similarity while sequence differences hold the key
cell, determining its shape and structure. A protein is a linear        information regarding diversity and evolution.
sequence of molecules called amino acids. Twenty different                   The most commonly used sequence analysis technique is
amino acids are commonly found in proteins. Similar to DNA,             pairwise sequence comparison. A sequence can be
proteins are conveniently represented as a string of letters            transformed to another sequence with the help of three edit
expressing their sequence of amino acids.                               operations. Each edit operation can insert a new letter, delete
                                                                        an existing letter, or replace an existing letter with a new one.
                                                                        The alignment of two sequences is defined by the edit
                                                                        operations that transform one into the other. This is usually
                                                                        represented by writing one on top of the other. Insertions and
                                                                        deletions (i.e., gaps) are represented by the dash symbol (“-”).
                                                                        The following example illustrates an alignment between the
                                                                        sequences A= “GAATTCAGTA” and B= “GGATCGTTA”. The
                                                                        objective is to match identical subsequences as far as possible
                                                                        (or equivalently use as few edit operations as possible). In the
                                                                        example, the aligned sequences match in seven positions.

                   Figure 2. DNA molecule.                                                Sequence A GAATTCAGT-A
                                                                                                         R D    D    I
   Once a protein is produced, it folds into a three-dimensional                          Sequence B GGA-TC-GTTA
shape. The positions of the central atoms, called carbon-alpha
(C∝), of the amino acids of a protein define its primary                Figure 4. Alignment of two sequences (The aligned sequences
structure. If a contiguous subsequence of C∝ atoms follows               match in seven positions. One replace, two delete, and one
some predefined pattern, they are classified as a secondary               insert operations, shown by letters R, D, and I, are used.)
structure, such as alpha-helix or beta-sheet. The relative                 Alignment of sequences is considered in two different but
positioning of the secondary structures define the tertiary             related classes: If the entire sequences are aligned, then it is
structure. The overall shape of all chains of a protein then            called a global alignment. If subsequences of two sequences
defines the quaternary structure.                                       are aligned, then it is called a local alignment.
                                                                           Multiple sequence alignment compares more than two
                                                                        sequences: all sequences are aligned on top of each other. Each
                                                                        column is the alignment of one letter from each sequence. The



                                                                    2
following example illustrates a multiple alignment among the
sequences A= “AGGTCAGTCTAGGAC”, B= “GGACTGAGGTC”, and
C=“GAGGACTGGCTACGGAC”.

     Sequence A             -AGGTCAGTCTA-GGAC
     Sequence B             --GGACTGA----GGTC
     Sequence C             GAGGACTGGCTACGGAC

Figure 5. Multiple alignment of three DNA sequences A, B, and C.

2.2.2 Molecular Phylogeny Analysis
                                                                          Figure 7. The structural similarity between two proteins.
   Molecular phylogeny infers lines of ancestry of genes or                               (source http://cl.sdsc.edu/)
organisms. Phylogeny analysis provides crucial understanding
about the origins of life and the homology of various species           2.2.4 Molecular Dynamics
on earth. Phylogenetic trees are composed of nodes and
                                                                           In the broadest sense, molecular dynamics is concerned with
branches. Each leaf node corresponds to a gene or an organism.
                                                                        molecular motion. Motion is inherent to all chemical processes.
Internal nodes represent inferred ancestors. The evolutionary
                                                                        Simple vibrations, like bond stretching and angle bending,
distance between two genes or organisms is computed as a
                                                                        give rise to IR spectra. Chemical reactions, hormone-receptor
function of the length of the branches between their nodes and
                                                                        binding, and other complex processes are associated with
their common ancestors.
                                                                        many kinds of intra- and intermolecular motions.
                                                                           Molecular dynamics allows the studying of the dynamics of
                                                                        large macromolecules, including biological systems such as
                                                                        proteins, nucleic acids (DNA, RNA), and membranes.
                                                                        Dynamic events may play a key role in controlling processes
                                                                        which affect functional properties of biomolecules. Drug
                                                                        design is commonly used in the pharmaceutical industry to test
                                                                        properties of a molecule at the computer without the need to
                                                                        synthesize it (which is far more expensive).
                                                                        2.3 Bioinformatics Databases
                                                                          A bioinformatics database is an organized body of persistent
    Figure 6. Evolutionary relationships between fish models.           data (e.g. nucleotide and amino acid sequences, three-
  Figure 6 shows evolutionary relationships between fish                dimensional structure). Thanks to the human genome project,
models. This evolutionary tree (based on data from Fishes of            there has been a growing interest both in the public and private
the World by J. S. Nelson) [6] illustrates that the last common         sectors towards creating bioinformatics databases. At the end
ancestor of medaka and zebrafish lived more than 110 million            of 2002, there were more than 300 molecular biology
years (Myr) ago.                                                        databases available worldwide. This section provides a brief
                                                                        overview of several popular and publicly available
2.2.3 Protein Structure Analysis                                        bioinformatics databases.
   Two protein substructures are called similar if their C∝               An important class of bioinformatics databases is the
atoms can be mapped to close by points after translation and            sequence database. The largest sequence database is the
rotation of one of the proteins. This can also be considered as a       NCBI/GenBank [1] which collects all known nucleotide and
one to one mapping of amino acids. Usually, structural                  protein sequences. Other major data sources are EMBL
similarity requires that the amino acid pairs that are considered       (European Molecular Biology Lab) [7] and DDBJ (DNA Data
similar have the same secondary structure type. Structural              Bank of Japan) [8]. Two major sources of protein sequences
similarities among proteins provide insight regarding their             and structures are PDB (Protein Data Bank) [9], and SWISS-
functional relationship. Figure 7 presents the structural               PROT [10]. PDB contains the protein structures determined by
similarity of two proteins.                                             NMR and X-ray crystallography techniques. SWISS-PROT is a
   Three-dimensional structures of only a small subset of               curated protein sequence database which provides a high level
proteins are known as it requires expensive wet-lab                     of annotation such as description of protein function, its
experimentation. Computationally determining the structure of           domain structure, post-translational modification and other
proteins is an important problem as it accelerates the                  useful information.
experimentation step and reduces expert analysis. Usually, the
relationship among chemical components of proteins (i.e. their          3. THE BIOINFORMATICS BENCHMARK
amino acid sequences) is used in determining their unique               SUITE: BIOINFOMARK
three-dimensional native structures.
                                                                          To allow computer architecture researchers to explore and
                                                                        evaluate their designs on these emerging applications, we



                                                                    3
developed a suite of representative bioinformatics workloads -         includes several applications such as hmmbuild, hmmcalibrate
BioInfoMark. Currently, the BioInfoMark package contains 14            and hmmsearch. Among these applications, the hmmsearch is
applications, which covers a variety of major important                widely used to search a sequence database for matches to an
bioinformatics tools ranging from sequence comparison to               HMM. The syntax of invoking benchmark hmmsearch
molecular dynamics. This section describes the selected                is:./hmmsearch <input file> <database file>. With the
programs, which can be classified using the categories we              provided dataset, this benchmark can be executed
introduced in Section 2.2.                                             as: ./hmmsearch ./globin.hmm ./Artemia.fa, where globin.hmm
3.1 Sequence Analysis Benchmarks                                       is the example HMM built from the alignment file of 50
                                                                       aligned globin sequences and the Artemia.fa is a FASTA file
   Blast: The Blast (Basic Local Alignment Search Tool)                of brine shrimp globin, which contains nine tandemly repeated
programs [11] are a set of heuristic methods that are used to          globin domains.
search sequence databases for local alignments to a query                 Glimmer: Glimmer (Gene Locator and Interpolated Markov
sequence. The Blast programs are written in C. BlastP and              Modeler) [15] finds genes in microbial DNA. Its uses
BlastN are the versions of Blast for protein and nucleotide            interpolated Markov models (IMMs) to identify coding and
sequences respectively. All Blast programs can be executed             noncoding regions in the DNA. Glimmer is written in the C++
using the following command line: ./blastall –p <option> -i            language and it can be executed as ./glimmer2 <input
<query file> -d <database file> -o <output file>. The option           sequence> <model file>. The command to invoke this
is “blastp” for searching protein sequences or “blastn” for            benchmark           on         the        given        dataset
searching nucleotide sequences. The query file is the file             is ./glimmer2 ./NC_000907.fna ./glimmer.icm, where
which includes the nucleotide or protein sequence for search.          NC_000907.fna is a kind of bacterium whose name is
The database file is the database which will be searched. With         Haemophilus_influenzae and glimmer.icm is the collection file
the provided dataset, the benchmark can be invoked as                  of Markov models.
follows: ./blastall –p blastp –i target.txt –d nr –o output,              Emboss: Emboss (European Molecular Biology Open
where target.txt is the homo sapiens hereditary                        Software Suite) [16] is a software package programmed in C,
haemochromatosis protein sequence and nr is the non-                   which contains a wide variety of programs ranging from
redundant protein sequence database NCBI.                              sequence alignment, protein motif identification to domain
   Fasta: Similar to Blast, Fasta [12] is a collection of local        analysis, and codon usage analysis. Diffseq, megamerg and
similarity search programs for sequence databases. While               shuffleseq are three representatives in the Emboss. Diffseq
Fasta and Blast both do pairwise local alignment, their                takes two overlapping, nearly identical sequences and reports
underlying algorithms are different. Fasta is programmed in C          the differences between them, together with any features that
and can be invoked using the following command                         overlap with these regions. The syntax of this benchmark
line: ./fasta34 <query file> <database file>. The query file           execution is ./diffseq <seq1> <seq2> -wordsize <output>,
and database file have the same meaning as those of Blast.             where seq1 and seq2 are two sequences for comparison. The
With the provided dataset, the Fasta benchmark can be                  wordsize refers to the size of which the program does a match
invoked as: ./fasta34 ./qrhuld.aa ../database/nr > ./output.txt,       of all sequence words. The output records the result after
where qrhuld.aa is a query file that contains the human LDL            differentiating both two sequences. With the provided dataset,
receptor precursor protein. The nr is the same database                diffseq can be invoked as:./diffseq tembl:ap000504
mentioned above.                                                       tembl:af129756 -wordsize 6 report, where ap000504 and
   Clustal W: Clustal W [13] is a multiple sequence alignment          af129756 are two homo sapiens genes in the nucleic acid
program for nucleotides or amino acids. It first finds a               database tembl. The Megamerg takes two overlapping nucleic
phylogenetic tree for the underlying sequences. It then                acid sequences and merges them into one sequence. It has the
progressively aligns them one by one based on their ancestral          same syntax and input parameters as those of diffseq. The
relationship. Clustal W is programmed in C and can be                  Shuffleseq takes a sequence as input and outputs one or more
executed as: ./clustalw -batch -infile= <input file> -outfile=         sequences whose order has been randomly shuffled. It can be
<output file>, where the input file includes multiple DNA or           invoked with the following command line:./shuffleseq -shuffle
protein sequences and the output file records the results after        1000 tembl:af129756 af129756.fasta. It means that the
alignment. The command line used to invoke clustal W with              program will shuffle the example nucleic acid for 1000 times
the provided dataset is: ./clustalw -batch -infile=./input.ext -       and produce the output file—af129756.fasta
outfile=./output.ext, where input.ext is a query file that             3.2 Molecular Phylogeny Analysis
includes 317 Ureaplasma’s gene sequences from the NCBI
Bacteria genomes database. The output.ext stores the                   Benchmarks
alignment results among those 317 protein sequences.                      Phylip: Phylip (PHYLogeny Inference Package) [17] is a
   Hmmer: Hmmer [14] employs hidden Markov models                      package of programs for inferring phylogenies (evolutionary
(profile HMMs) for aligning multiple sequences. Profile                trees). Methods that are available in the package include
HMMs are statistical models of multiple sequence alignments.           parsimony,       distance   matrix,    maximum       likelihood,
They capture position-specific information about how                   bootstrapping, and consensus trees. Data types that can be
conserved each column of the alignment is, and which residues          handled include molecular sequences, gene frequencies,
are likely. Hmmer is programmed in the C language. It                  restriction sites and fragments, distance matrices, and discrete



                                                                   4
characters. The phylip package is programmed in C. Dnapenny              accurate architecture research frameworks (such as
and promlk are the typical applications in the phylip.                   SimpleScalar, SimAlpha, and M5), we made an extra effort to
Dnapenny is a program that finds all of the most parsimonious            produce the Alpha binaries of the majority of BioInfoMark
trees of the input data. Promlk implements the maximum                   benchmarks. We have tested all pre-compiled Alpha binaries
likelihood method for protein amino acid sequences. They                 (with static link option) using the Simplescalar sim-outorder
both can run in command line method or interactive method.               simulator. The pre-compiled binaries are available in the
To provide deterministic execution, we provide execution                 BioInfoMark package.
script to invoke the two benchmarks.
3.3 Protein Structure Analysis Benchmarks                                Table 1. Benchmarks with pre-compiled Alpha binaries (all
                                                                         binaries have been successfully tested on Simplescalar Sim-
   DALI: Dali [18] performs pairwise structure comparison as             outorder simulator)
well as finds the structural neighbors of a protein by                  Benchmark                     Input Dataset
comparing it against the proteins in the PDB. By default, Dali                    human LDL receptor precursor protein, NCBI nr
is accessible only through the network, and is too complex and          fasta34
                                                                                  database
large to install. So the Dalilite distribution programmed in Perl                 317 Ureaplasma’s gene sequences from the NCBI
                                                                        clustalw
and Fortran 77 is developed for local and efficient use. It has                   Bacteria genomes database
the core algorithmic functionality of the Dali server. The input                  a profile HMM built from the alignment of 50 globin
is two sets of atomic coordinates of proteins in PDB format.            hmmsearch
                                                                                  sequences, uniprot_sprot.dat from SWISS-PROT
With the provided dataset, Dali can be invoked as: ./DaliLite –
pairwise ./pdb /1DPS.pdb ./pdb /2AV8.pdb, where 1DPS.pdb                glimmer2     18 bacteria complete genomes from the NCBI
is a type of DNA-binding protein. 2AV8.pdb is a type of                              genomes database
oxidoreductase that is an enzyme catalyzes an oxidation-                diffseq      nucleic acid database EMBL
reduction reaction.                                                     megamerger nucleic acid database EMBL
   CE: CE (Combinatorial Extension) [19] finds structural               shuffleseq   nucleic acid database EMBL
similarities between the primary structures of pairs of proteins.
                                                                        dnapenny     ribosomal RNAs from bacteria and mitochondria
CE first aligns small fragments from two proteins. Later, these
fragments are combined and extended to find larger similar                           protein amino acid sequences of 17 species ranging
                                                                        promlk
substructures. CE is written in C. It can be invoked as ./CE -                       from a deep branching bacterium to humans
 ./1hba.pdb - ./4hhb.pdb - ./scratch, where 1hba.pdb and                             100 Eukaryote protein sequences from NCBI
                                                                        predator
4hhb.pdb are different types of hemoglobin which is used to                          genomes database
transport oxygen. Scratch is a directory to store temporary
files generated during execution.
                                                                         3.6 Simulation Points of BioInfoMark
   Predator: Predator [20] predicts the secondary structure of           Workloads
a protein sequence or a set of sequences based on their amino               Our earlier study [4] shows that bioinformatics applications
acid sequences. The Predator is also programmed in C. It can             can execution billions of instructions before completion.
be launched using the following command: ./predator -a -l                Therefore, it is infeasible to simulate entire benchmark
<seq> -f<output>. With the provided dataset, Predator can be             execution using detailed cycle-accurate simulators. Recently,
executed as:./predator -a -l eukaryota_100.seq -                         the computer architecture research community has widely
feukaryota_100.out, where eukaryota_100.seq includes 100                 adopted SimPoint [5] methodology as an efficient way to
Eukaryote protein sequences from NCBI genomes database                   simulate the representative workload execution phases. We
and eukaryota_100.out is the result of the secondary structure           used the SimPoint framework developed by Calder et al. to
prediction.                                                              generate the simulation points of the BioInfoMark benchmarks
3.4 Molecular Dynamics Simulation                                        listed in the Table 1. Since the total number of instructions of
Benchmarks                                                               different benchmarks varies significantly, we used the criteria
                                                                         suggested in [22] to determine the size of interval for each
  Gamess: Gamess (General Atomic and Molecular                           individual benchmark. Table 2 lists the interval size as well as
Electronic Structure System) [21] is a general ab initio                 the simulation points for each benchmark.
quantum chemistry package. Gamess can compute SCF wave
functions and a variety of molecular properties, ranging from            4. CONCLUSIONS
simple    dipole     moments     to    frequency      dependent
hyperpolarizabilities. Gamess is written in Fortran 77. Gamess              Bioinformatics applications represent increasingly important
can be invoked as ./runall >& ./runall.log. It will use 37 short         computer workloads. In order to apply a quantitative approach
but diverse examples named EXAM*.INP as the input dataset.               in computer architecture design and performance evaluation,
                                                                         there is a clear need to develop a benchmark suite of
3.5 BioInfoMark Alpha Binaries for Simulation                            representative bioinformatics applications. This paper presents
based Studies                                                            a group of programs representative of bioinformatics software.
  To allow computer architecture researchers to simulate the             These programs include popular tools used for sequence
BioInfoMark benchmarks using execution-driven and cycle-                 alignments, molecular phylogeny analysis, protein structure
                                                                         prediction and molecular dynamics. The benchmark suite


                                                                    5
BioInfoMark is freely available and can be downloaded from                      International Conference on Parallel Architectures and
www.ideal.ece.ufl.edu/BioInfoMark. In the future, we will                       Compilation Techniques, 2003.
explore integrated software/hardware techniques to optimize                [6] J. S. Nelson, Fishes of the World, John Wiley & Sons, Inc.,
the performance of bioinformatics applications.                                New York, 1994.
                                                                           [7] European Molecular Biology Laboratory,
Table 2. The simulation points of BioInfoMark Alpha binaries
                                                                               http://www.embl-heidelberg.de
Benchmark      Interval             Simulation Points
                 (M)                                                       [8] DNA Data Bank of Japan, http://www.ddbj.nig.ac.jp/
                           412,698,242,326,810,961,503,487,354,105
fasta34                                                                    [9] The RCSB Protein Data Bank, http://www.rcsb.org/pdb/
                  400      1,932,8,832,792,459,988,54,107,482,808,1
                           36,554,996,588                                  [10] The UniProt/Swiss-Prot Database,
                           24,918,906,701,702,674,793,395,252,845,7             http://www.ebi.ac.uk/swissprot/
clustalw          850
                           19,883,857,585,858,651,381,817                  [11] S. Altschul, W. Gish, W. Miller, E. W. Meyers and D. J.
                                                                                Lipman, Basic Local Alignment Search Tool, Journal of
                           340,674,330,695,711,75,619,599,54,677,55
hmmsearch         680
                           1,672,794,40,404,682,618,370,457,951                 Molecular Biology, vol. 215, no. 3, pages 403-410, 1990.
                                                                           [12] W.R. Pearson and D.J. Lipman, Improved tools for
                           581,1,45,70,32,13,2,17,1567,8,404,26,84,1
glimmer2           20                                                           biological sequence comparison, Proc. Natl. Acad. Sci., 85
                           572,6,21,494,1109,40,1240
                                                                                (1988), 3244–3248.
                           705,680,444,597,442,255,662,10,3,977,288
diffseq            35      ,443,1006,827,990,1004,343,927,707,256,9        [13] J. D. Thompson, D.G. Higgins, and T.J. Gibson, Clustal W:
                           8,1054,689,964,780,958,824,942                       Improving the Sensitivity of Progressive Multiple Sequence
                                                                                Alignment through Sequence Weighting, Positions-specific
                           703,559,254,596,679,443,377,825,259,106              Gap Penalties and Weight Matrix Choice, Nucleic Acids
megamerger         35      7,3,442,901,255,310,593,94,773,976,758,8
                                                                                Research, vol. 22, no. 22, pages 4673-4680, 1994.
                           89,781,1058,818,5,560
                                                                           [14] S. R. Eddy, Profile Hidden Markov Models, Bioinformatics
                           318,1,718,981,280,115,18,355,261,365,101             Review, vol. 14, no. 9, page 755-763, 1998.
shuffleseq        300      9,457,194,1018,196,43,254,406,226,775,84
                           2,454,894,986,776                               [15] S. Salzberg, A. Delcher, S. Kasif, and O. White, Microbial
                                                                                Gene Identification using Interpolated Markov Models,
                           95,1048,1043,225,672,1,61,413,185,627,51
dnapenny          140      ,1014,1056,2,110,911,200,1003,777,970,63
                                                                                Nucleic Acids Research, vol. 26, no. 2, page 544-548, 1998.
                           9,1053,40,438,576,597                           [16] P. Rice, I. Longden, and A. Bleasby, EMBOSS: The
                           21,911,255,1,320,548,876,713,176,319,813             European Molecular Biology Open Software Suite, Trends in
promlk            320      ,332,818,932,445,807,909,773,580,854,224             Genetics, vol. 16, no 6, page 276-277, 2000.
                           ,719,700,969,472,715,1008,661,375,210
                                                                           [17] J. Felsenstein, PHYLIP - Phylogeny Inference Package
                           272,247,758,195,1143,536,535,166,585,81,             (version 3.2), Cladistics, 5: 164-166, 1989.
predator          700      233,296,482,640,429,406,88,343,203,403,4
                           79,955,37,971                                   [18] L. Holm and J. Park, DaliLite Workbench for Protein
                                                                                Structure, Bioinformatics Applications Note, vol. 16, no.6,
5. REFERENCES                                                                   pages 566- 567, 2000.
[1] http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html                  [19] I. N. Shindyalov, and P. E. Bourne, Protein Structure
                                                                                Alignment by Incremental Combinatorial Extension (CE) of
[2] Bioinformation Market Study for Washington Technology                       the Optimal Path, Protein Engineering, vol. 11, no. 99, page
    Center, Alta Biomedical Group LLC,                                          739-747, 1998.
    www.altabiomedical.com, June 2003.
                                                                           [20] D. Frishman, and P. Argos, 75% Accuracy in Protein
[3] K. Albayraktaroglu et al., BioBench: A Benchmark Suite of                   Secondary Structure Prediction, Proteins, vol. 27, page 329-
    Bioinformatics Applications, International Symposium on                     335, 1997.
    Performance Analysis of Software and Systems, 2005.
                                                                           [21] M. W. Schmidt, et al General Atomic and Molecular
[4] Y. Li, T. Li, T. Kahveci and J. Fortes, Workload                            Electronic Structure System, Journal of Comput. Chem., vol.
    Characterization of Bioinformatics Applications on Pentium                  14, page 1347-1363, 1993.
    4 Architecture, In Proceedings of the International
    Symposium on Modeling, Analysis, and Simulation of                     [22] G. Hamerly, E. Perelman and B. Calder, How to Use
    Computer and Telecommunication Systems, 2005.                               SimPoint to Pick Simulation Points, ACM SIGMETRICS
                                                                                Performance Evaluation Review, 2004.
[5] E. Perelman, G. Hamerly and B. Calder, Picking Statistically
    Valid and Early Simulation Points, In Proceedings of the




                                                                       6

More Related Content

What's hot

ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MININGANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MININGijbbjournal
 
JEVBase: An Interactive Resource for Protein Annotationof JE Virus
JEVBase: An Interactive Resource for Protein Annotationof JE VirusJEVBase: An Interactive Resource for Protein Annotationof JE Virus
JEVBase: An Interactive Resource for Protein Annotationof JE VirusCSCJournals
 
Protein sequence classification in data mining– a study
Protein sequence classification in data mining– a studyProtein sequence classification in data mining– a study
Protein sequence classification in data mining– a studyZac Darcy
 
protein databases
 protein databases protein databases
protein databaseswasisyed
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsmaulikchaudhary8
 
Recent trends in bioinformatics
Recent trends in bioinformaticsRecent trends in bioinformatics
Recent trends in bioinformaticsZeeshan Hanjra
 
Bioinformatics-General_Intro
Bioinformatics-General_IntroBioinformatics-General_Intro
Bioinformatics-General_IntroAbhiroop Ghatak
 
Bioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sirBioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sirKAUSHAL SAHU
 
NetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizNetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizAlexander Pico
 
Databases in Bioinformatics
Databases in BioinformaticsDatabases in Bioinformatics
Databases in BioinformaticsMeghaj Mallick
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsAyeshaYousaf20
 
NetBioSIG2012 chrisevelo
NetBioSIG2012 chriseveloNetBioSIG2012 chrisevelo
NetBioSIG2012 chriseveloAlexander Pico
 

What's hot (20)

ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MININGANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
 
JEVBase: An Interactive Resource for Protein Annotationof JE Virus
JEVBase: An Interactive Resource for Protein Annotationof JE VirusJEVBase: An Interactive Resource for Protein Annotationof JE Virus
JEVBase: An Interactive Resource for Protein Annotationof JE Virus
 
Protein sequence classification in data mining– a study
Protein sequence classification in data mining– a studyProtein sequence classification in data mining– a study
Protein sequence classification in data mining– a study
 
protein databases
 protein databases protein databases
protein databases
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Bioinformatics ppt
Bioinformatics pptBioinformatics ppt
Bioinformatics ppt
 
Biological Database
Biological DatabaseBiological Database
Biological Database
 
Recent trends in bioinformatics
Recent trends in bioinformaticsRecent trends in bioinformatics
Recent trends in bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics Software
Bioinformatics SoftwareBioinformatics Software
Bioinformatics Software
 
Bioinformatics-General_Intro
Bioinformatics-General_IntroBioinformatics-General_Intro
Bioinformatics-General_Intro
 
Bioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sirBioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sir
 
NetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizNetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-viz
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 
Protein database
Protein databaseProtein database
Protein database
 
PIR & MINT
PIR & MINTPIR & MINT
PIR & MINT
 
Databases in Bioinformatics
Databases in BioinformaticsDatabases in Bioinformatics
Databases in Bioinformatics
 
OMIM Database
OMIM DatabaseOMIM Database
OMIM Database
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomics
 
NetBioSIG2012 chrisevelo
NetBioSIG2012 chriseveloNetBioSIG2012 chrisevelo
NetBioSIG2012 chrisevelo
 

Similar to 57 bio infomark

LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSMSCW Mysore
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptxAshuAsh15
 
introduction of Bioinformatics
introduction of Bioinformaticsintroduction of Bioinformatics
introduction of BioinformaticsVinaKhan1
 
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDYPROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDYZac Darcy
 
BIOINFO unit 1.pptx
BIOINFO unit 1.pptxBIOINFO unit 1.pptx
BIOINFO unit 1.pptxrnath286
 
Introduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfIntroduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfkigaruantony
 
Computational Genomics - Bioinformatics - IK
Computational Genomics - Bioinformatics - IKComputational Genomics - Bioinformatics - IK
Computational Genomics - Bioinformatics - IKIlgın Kavaklıoğulları
 
Tools of bioinforformatics by kk
Tools of bioinforformatics by kkTools of bioinforformatics by kk
Tools of bioinforformatics by kkKAUSHAL SAHU
 
Bioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolBioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolJesminBinti
 
Overall Vision for NRNB: 2015-2020
Overall Vision for NRNB: 2015-2020Overall Vision for NRNB: 2015-2020
Overall Vision for NRNB: 2015-2020Alexander Pico
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introductionDrGopaSarma
 
Pcmd bioinformatics-lecture i
Pcmd bioinformatics-lecture iPcmd bioinformatics-lecture i
Pcmd bioinformatics-lecture iMuhammad Younis
 
A comparative study using different measure of filteration
A comparative study using different measure of filterationA comparative study using different measure of filteration
A comparative study using different measure of filterationpurkaitjayati29
 

Similar to 57 bio infomark (20)

LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICS
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
introduction of Bioinformatics
introduction of Bioinformaticsintroduction of Bioinformatics
introduction of Bioinformatics
 
MoM2010: Bioinformatics
MoM2010: BioinformaticsMoM2010: Bioinformatics
MoM2010: Bioinformatics
 
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDYPROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
 
BIOINFO unit 1.pptx
BIOINFO unit 1.pptxBIOINFO unit 1.pptx
BIOINFO unit 1.pptx
 
Introduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfIntroduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdf
 
Computational Genomics - Bioinformatics - IK
Computational Genomics - Bioinformatics - IKComputational Genomics - Bioinformatics - IK
Computational Genomics - Bioinformatics - IK
 
Tools of bioinforformatics by kk
Tools of bioinforformatics by kkTools of bioinforformatics by kk
Tools of bioinforformatics by kk
 
Bioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolBioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST Tool
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Overall Vision for NRNB: 2015-2020
Overall Vision for NRNB: 2015-2020Overall Vision for NRNB: 2015-2020
Overall Vision for NRNB: 2015-2020
 
Cytoscape Talk 2010
Cytoscape Talk 2010Cytoscape Talk 2010
Cytoscape Talk 2010
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
 
Pcmd bioinformatics-lecture i
Pcmd bioinformatics-lecture iPcmd bioinformatics-lecture i
Pcmd bioinformatics-lecture i
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
50120140504019 2
50120140504019 250120140504019 2
50120140504019 2
 
B I O I N F O R M A T I C S An Intro
B I O I N F O R M A T I C S  An  IntroB I O I N F O R M A T I C S  An  Intro
B I O I N F O R M A T I C S An Intro
 
A comparative study using different measure of filteration
A comparative study using different measure of filterationA comparative study using different measure of filteration
A comparative study using different measure of filteration
 

57 bio infomark

  • 1. BioInfoMark: A Bioinformatic Benchmark Suite for Computer Architecture Research Yue Li and Tao Li Intelligent Design of Efficient Architecture Lab(IDEAL) Department of Electrical and Computer Engineering University of Florida yli@ecel.ufl.edu, taoli@ece.ufl.edu ABSTRACT Bioinformatics allows researchers to sift through the massive bioinformatics data (e.g., nucleic acid and protein sequences, The exponential growth in the amount of genomic data has structures, functions, pathways and interactions) and identify spurred growing interest in large scale analysis of genetic information of interest. information. Bioinformatics applications, which explore Today, bioinformatics has become an industry and has computational methods to allow researchers to sift through the gained popularity among numerous markets including massive biological data and extract useful information, are pharmaceutical, (industrial, agricultural and environmental) becoming increasingly important computer workloads. This biotechnology and homeland security. A number of recent paper presents BioInfoMark, a benchmark suite of market research reports estimate the size of the bioinformatics representative bioinformatics applications to facilitate the market as $176 billion and the market is projected to grow to design and evaluation of computer architectures for these $243 billion within the next 5 years [2]. In August 2000, IBM emerging workloads. Currently, the BioInfoMark suite announced an initial $100 million investment to spur business contains 14 highly popular bioinformatics tools and covers the development in the life sciences, assuring its prominence as major fields of study in computational biology such as one of the emerging computing markets. There are over 1,300 sequence comparison, phylogenetic analysis, protein structure biotech companies only in US and some 1,600 companies in analysis, and molecular dynamics simulation. Europe. The BioInfoMark package includes benchmark source code, Clearly, computer systems that can cost-effectively deliver input datasets and information for compiling and using the high-performance on computational biology applications play benchmarks. To allow computer architecture researchers to a vita role in the future growth of the bioinformatics market. In run the BioInfoMark suite on several popular execution driven order to apply a quantitative approach in computer architecture simulators, we provide pre-compiled little-endian Alpha ISA design, optimization and performance evaluation, researchers binaries and generated simulation points. The BioInfoMark need to identify representative workloads from this emerging package is freely available and can be downloaded from: application domain first. http://www.ideal.ece.ufl.edu/BioInfoMark. This paper presents BioInfoMark, a benchmark suite of 1. INTRODUCTION representative bioinformatics applications to facilitate the design and evaluation of computer architectures for these The major breakthrough in the field of molecular biology, emerging workloads. Currently, the BioInfoMark suite coupled with advances in genomic technologies has led to an contains 14 highly popular bioinformatics tools and covers the explosive growth in the area of informatics. For example, the major fields of study in computational biology such as National Center for Biotechnology Information (NCBI) sequence comparison, phylogenetic analysis, protein structure GenBank, an annotated collection of all publicly available analysis, and molecular dynamics simulation. Sequence DNA sequences, has been growing at an exponential rate. comparison finds similarities between two or more DNA or protein sequences. Phylogeny explores the ancestral relationship among a set of genes or organisms. Protein structure analysis (a) finds the similarities between three- dimensional protein structures and (b) predicts the shape of a protein (e.g., primary, secondary, and tertiary structure) given its amino acid sequence. Molecular dynamics explores the interactions among biomolecules. Compared with a recent study reported in [3], our independent work covers many more bioinformatics tools in terms of quantity and diversity. To allow computer architecture researchers to run the BioInfoMark suite on Figure 1. Growth of GenBank: There are 49,398,852,122 bases several popular execution driven simulators (e.g. Simplescalar, from 45,236,251 sequences in GenBank as of Jun. 15 2005 SimAlpha, and M5), we put additional effort into providing (source: [1]). pre-compiled little-endian Alpha ISA binaries and generating As genomics moves forward, having accessible simulation points. A detailed, quantitative workload computational methods with which to extract, view, and characterization of the BioInfoMark benchmarks on Pentium 4 analyze genomic information, becomes essential. microarchitecture can be found in [4]. 1
  • 2. The rest of this paper is organized as follows. Section 2 provides an introductory background on biology and a brief review of bioinformatics study areas. Section 3 describes the BioInfoMark suite, including benchmark functionality, input datasets, benchmark compilation and execution, the pre- compiled Alpha binaries, and the generated simulation points[5]. Section 4 concludes the paper and outlines our future work. 2. BACKGROUND To help readers to understand the BioInfoMark benchmarks better, we first provide an introductory background on biology Figure 3. Three dimensional structure of human foetal and illustrate the major areas of bioinformatics. deoxyhaemoglobin (PDB id = 1FDH). 2.1 Introduction: DNA, Gene and Proteins 2.2 Bioinformatics Problems One of the fundamental principles of biology is that within In this section, we illustrate the major problems in each cell, DNA that comprises the genes encodes RNA which bioinformatics, including sequence analysis, phylogeny, in turn produces the proteins that regulate all of the biological protein structure analysis/prediction and molecular dynamics. processes within an organism. 2.2.1 Sequence Analysis DNA is a double chain of simpler molecules called nucleotides, tied together in a double helix helical structure Sequence analysis is perhaps the most commonly performed (Figure 2). The nucleotides are distinguished by a nitrogen task in bioinformatics. Sequence analysis can be defined as the base that can be of four kinds: adenine (A), cytosine (C), problem of finding which parts of the sequences (nucleotide or guanine (G) and thymine (T). Adenine (A) always bonds to amino acid sequences) are similar and which parts are different. thymine (T) whereas cytosine (C) always bonds to guanine (G), By comparing sequences, researchers can gain crucial forming base pairs. A DNA can be specified uniquely by understanding of their significance and functionality: high listing its sequence of nucleotides, or base pairs. Proteins are sequence similarity usually implies significant functional or molecules that accomplish most of the functions of a living structural similarity while sequence differences hold the key cell, determining its shape and structure. A protein is a linear information regarding diversity and evolution. sequence of molecules called amino acids. Twenty different The most commonly used sequence analysis technique is amino acids are commonly found in proteins. Similar to DNA, pairwise sequence comparison. A sequence can be proteins are conveniently represented as a string of letters transformed to another sequence with the help of three edit expressing their sequence of amino acids. operations. Each edit operation can insert a new letter, delete an existing letter, or replace an existing letter with a new one. The alignment of two sequences is defined by the edit operations that transform one into the other. This is usually represented by writing one on top of the other. Insertions and deletions (i.e., gaps) are represented by the dash symbol (“-”). The following example illustrates an alignment between the sequences A= “GAATTCAGTA” and B= “GGATCGTTA”. The objective is to match identical subsequences as far as possible (or equivalently use as few edit operations as possible). In the example, the aligned sequences match in seven positions. Figure 2. DNA molecule. Sequence A GAATTCAGT-A R D D I Once a protein is produced, it folds into a three-dimensional Sequence B GGA-TC-GTTA shape. The positions of the central atoms, called carbon-alpha (C∝), of the amino acids of a protein define its primary Figure 4. Alignment of two sequences (The aligned sequences structure. If a contiguous subsequence of C∝ atoms follows match in seven positions. One replace, two delete, and one some predefined pattern, they are classified as a secondary insert operations, shown by letters R, D, and I, are used.) structure, such as alpha-helix or beta-sheet. The relative Alignment of sequences is considered in two different but positioning of the secondary structures define the tertiary related classes: If the entire sequences are aligned, then it is structure. The overall shape of all chains of a protein then called a global alignment. If subsequences of two sequences defines the quaternary structure. are aligned, then it is called a local alignment. Multiple sequence alignment compares more than two sequences: all sequences are aligned on top of each other. Each column is the alignment of one letter from each sequence. The 2
  • 3. following example illustrates a multiple alignment among the sequences A= “AGGTCAGTCTAGGAC”, B= “GGACTGAGGTC”, and C=“GAGGACTGGCTACGGAC”. Sequence A -AGGTCAGTCTA-GGAC Sequence B --GGACTGA----GGTC Sequence C GAGGACTGGCTACGGAC Figure 5. Multiple alignment of three DNA sequences A, B, and C. 2.2.2 Molecular Phylogeny Analysis Figure 7. The structural similarity between two proteins. Molecular phylogeny infers lines of ancestry of genes or (source http://cl.sdsc.edu/) organisms. Phylogeny analysis provides crucial understanding about the origins of life and the homology of various species 2.2.4 Molecular Dynamics on earth. Phylogenetic trees are composed of nodes and In the broadest sense, molecular dynamics is concerned with branches. Each leaf node corresponds to a gene or an organism. molecular motion. Motion is inherent to all chemical processes. Internal nodes represent inferred ancestors. The evolutionary Simple vibrations, like bond stretching and angle bending, distance between two genes or organisms is computed as a give rise to IR spectra. Chemical reactions, hormone-receptor function of the length of the branches between their nodes and binding, and other complex processes are associated with their common ancestors. many kinds of intra- and intermolecular motions. Molecular dynamics allows the studying of the dynamics of large macromolecules, including biological systems such as proteins, nucleic acids (DNA, RNA), and membranes. Dynamic events may play a key role in controlling processes which affect functional properties of biomolecules. Drug design is commonly used in the pharmaceutical industry to test properties of a molecule at the computer without the need to synthesize it (which is far more expensive). 2.3 Bioinformatics Databases A bioinformatics database is an organized body of persistent Figure 6. Evolutionary relationships between fish models. data (e.g. nucleotide and amino acid sequences, three- Figure 6 shows evolutionary relationships between fish dimensional structure). Thanks to the human genome project, models. This evolutionary tree (based on data from Fishes of there has been a growing interest both in the public and private the World by J. S. Nelson) [6] illustrates that the last common sectors towards creating bioinformatics databases. At the end ancestor of medaka and zebrafish lived more than 110 million of 2002, there were more than 300 molecular biology years (Myr) ago. databases available worldwide. This section provides a brief overview of several popular and publicly available 2.2.3 Protein Structure Analysis bioinformatics databases. Two protein substructures are called similar if their C∝ An important class of bioinformatics databases is the atoms can be mapped to close by points after translation and sequence database. The largest sequence database is the rotation of one of the proteins. This can also be considered as a NCBI/GenBank [1] which collects all known nucleotide and one to one mapping of amino acids. Usually, structural protein sequences. Other major data sources are EMBL similarity requires that the amino acid pairs that are considered (European Molecular Biology Lab) [7] and DDBJ (DNA Data similar have the same secondary structure type. Structural Bank of Japan) [8]. Two major sources of protein sequences similarities among proteins provide insight regarding their and structures are PDB (Protein Data Bank) [9], and SWISS- functional relationship. Figure 7 presents the structural PROT [10]. PDB contains the protein structures determined by similarity of two proteins. NMR and X-ray crystallography techniques. SWISS-PROT is a Three-dimensional structures of only a small subset of curated protein sequence database which provides a high level proteins are known as it requires expensive wet-lab of annotation such as description of protein function, its experimentation. Computationally determining the structure of domain structure, post-translational modification and other proteins is an important problem as it accelerates the useful information. experimentation step and reduces expert analysis. Usually, the relationship among chemical components of proteins (i.e. their 3. THE BIOINFORMATICS BENCHMARK amino acid sequences) is used in determining their unique SUITE: BIOINFOMARK three-dimensional native structures. To allow computer architecture researchers to explore and evaluate their designs on these emerging applications, we 3
  • 4. developed a suite of representative bioinformatics workloads - includes several applications such as hmmbuild, hmmcalibrate BioInfoMark. Currently, the BioInfoMark package contains 14 and hmmsearch. Among these applications, the hmmsearch is applications, which covers a variety of major important widely used to search a sequence database for matches to an bioinformatics tools ranging from sequence comparison to HMM. The syntax of invoking benchmark hmmsearch molecular dynamics. This section describes the selected is:./hmmsearch <input file> <database file>. With the programs, which can be classified using the categories we provided dataset, this benchmark can be executed introduced in Section 2.2. as: ./hmmsearch ./globin.hmm ./Artemia.fa, where globin.hmm 3.1 Sequence Analysis Benchmarks is the example HMM built from the alignment file of 50 aligned globin sequences and the Artemia.fa is a FASTA file Blast: The Blast (Basic Local Alignment Search Tool) of brine shrimp globin, which contains nine tandemly repeated programs [11] are a set of heuristic methods that are used to globin domains. search sequence databases for local alignments to a query Glimmer: Glimmer (Gene Locator and Interpolated Markov sequence. The Blast programs are written in C. BlastP and Modeler) [15] finds genes in microbial DNA. Its uses BlastN are the versions of Blast for protein and nucleotide interpolated Markov models (IMMs) to identify coding and sequences respectively. All Blast programs can be executed noncoding regions in the DNA. Glimmer is written in the C++ using the following command line: ./blastall –p <option> -i language and it can be executed as ./glimmer2 <input <query file> -d <database file> -o <output file>. The option sequence> <model file>. The command to invoke this is “blastp” for searching protein sequences or “blastn” for benchmark on the given dataset searching nucleotide sequences. The query file is the file is ./glimmer2 ./NC_000907.fna ./glimmer.icm, where which includes the nucleotide or protein sequence for search. NC_000907.fna is a kind of bacterium whose name is The database file is the database which will be searched. With Haemophilus_influenzae and glimmer.icm is the collection file the provided dataset, the benchmark can be invoked as of Markov models. follows: ./blastall –p blastp –i target.txt –d nr –o output, Emboss: Emboss (European Molecular Biology Open where target.txt is the homo sapiens hereditary Software Suite) [16] is a software package programmed in C, haemochromatosis protein sequence and nr is the non- which contains a wide variety of programs ranging from redundant protein sequence database NCBI. sequence alignment, protein motif identification to domain Fasta: Similar to Blast, Fasta [12] is a collection of local analysis, and codon usage analysis. Diffseq, megamerg and similarity search programs for sequence databases. While shuffleseq are three representatives in the Emboss. Diffseq Fasta and Blast both do pairwise local alignment, their takes two overlapping, nearly identical sequences and reports underlying algorithms are different. Fasta is programmed in C the differences between them, together with any features that and can be invoked using the following command overlap with these regions. The syntax of this benchmark line: ./fasta34 <query file> <database file>. The query file execution is ./diffseq <seq1> <seq2> -wordsize <output>, and database file have the same meaning as those of Blast. where seq1 and seq2 are two sequences for comparison. The With the provided dataset, the Fasta benchmark can be wordsize refers to the size of which the program does a match invoked as: ./fasta34 ./qrhuld.aa ../database/nr > ./output.txt, of all sequence words. The output records the result after where qrhuld.aa is a query file that contains the human LDL differentiating both two sequences. With the provided dataset, receptor precursor protein. The nr is the same database diffseq can be invoked as:./diffseq tembl:ap000504 mentioned above. tembl:af129756 -wordsize 6 report, where ap000504 and Clustal W: Clustal W [13] is a multiple sequence alignment af129756 are two homo sapiens genes in the nucleic acid program for nucleotides or amino acids. It first finds a database tembl. The Megamerg takes two overlapping nucleic phylogenetic tree for the underlying sequences. It then acid sequences and merges them into one sequence. It has the progressively aligns them one by one based on their ancestral same syntax and input parameters as those of diffseq. The relationship. Clustal W is programmed in C and can be Shuffleseq takes a sequence as input and outputs one or more executed as: ./clustalw -batch -infile= <input file> -outfile= sequences whose order has been randomly shuffled. It can be <output file>, where the input file includes multiple DNA or invoked with the following command line:./shuffleseq -shuffle protein sequences and the output file records the results after 1000 tembl:af129756 af129756.fasta. It means that the alignment. The command line used to invoke clustal W with program will shuffle the example nucleic acid for 1000 times the provided dataset is: ./clustalw -batch -infile=./input.ext - and produce the output file—af129756.fasta outfile=./output.ext, where input.ext is a query file that 3.2 Molecular Phylogeny Analysis includes 317 Ureaplasma’s gene sequences from the NCBI Bacteria genomes database. The output.ext stores the Benchmarks alignment results among those 317 protein sequences. Phylip: Phylip (PHYLogeny Inference Package) [17] is a Hmmer: Hmmer [14] employs hidden Markov models package of programs for inferring phylogenies (evolutionary (profile HMMs) for aligning multiple sequences. Profile trees). Methods that are available in the package include HMMs are statistical models of multiple sequence alignments. parsimony, distance matrix, maximum likelihood, They capture position-specific information about how bootstrapping, and consensus trees. Data types that can be conserved each column of the alignment is, and which residues handled include molecular sequences, gene frequencies, are likely. Hmmer is programmed in the C language. It restriction sites and fragments, distance matrices, and discrete 4
  • 5. characters. The phylip package is programmed in C. Dnapenny accurate architecture research frameworks (such as and promlk are the typical applications in the phylip. SimpleScalar, SimAlpha, and M5), we made an extra effort to Dnapenny is a program that finds all of the most parsimonious produce the Alpha binaries of the majority of BioInfoMark trees of the input data. Promlk implements the maximum benchmarks. We have tested all pre-compiled Alpha binaries likelihood method for protein amino acid sequences. They (with static link option) using the Simplescalar sim-outorder both can run in command line method or interactive method. simulator. The pre-compiled binaries are available in the To provide deterministic execution, we provide execution BioInfoMark package. script to invoke the two benchmarks. 3.3 Protein Structure Analysis Benchmarks Table 1. Benchmarks with pre-compiled Alpha binaries (all binaries have been successfully tested on Simplescalar Sim- DALI: Dali [18] performs pairwise structure comparison as outorder simulator) well as finds the structural neighbors of a protein by Benchmark Input Dataset comparing it against the proteins in the PDB. By default, Dali human LDL receptor precursor protein, NCBI nr is accessible only through the network, and is too complex and fasta34 database large to install. So the Dalilite distribution programmed in Perl 317 Ureaplasma’s gene sequences from the NCBI clustalw and Fortran 77 is developed for local and efficient use. It has Bacteria genomes database the core algorithmic functionality of the Dali server. The input a profile HMM built from the alignment of 50 globin is two sets of atomic coordinates of proteins in PDB format. hmmsearch sequences, uniprot_sprot.dat from SWISS-PROT With the provided dataset, Dali can be invoked as: ./DaliLite – pairwise ./pdb /1DPS.pdb ./pdb /2AV8.pdb, where 1DPS.pdb glimmer2 18 bacteria complete genomes from the NCBI is a type of DNA-binding protein. 2AV8.pdb is a type of genomes database oxidoreductase that is an enzyme catalyzes an oxidation- diffseq nucleic acid database EMBL reduction reaction. megamerger nucleic acid database EMBL CE: CE (Combinatorial Extension) [19] finds structural shuffleseq nucleic acid database EMBL similarities between the primary structures of pairs of proteins. dnapenny ribosomal RNAs from bacteria and mitochondria CE first aligns small fragments from two proteins. Later, these fragments are combined and extended to find larger similar protein amino acid sequences of 17 species ranging promlk substructures. CE is written in C. It can be invoked as ./CE - from a deep branching bacterium to humans ./1hba.pdb - ./4hhb.pdb - ./scratch, where 1hba.pdb and 100 Eukaryote protein sequences from NCBI predator 4hhb.pdb are different types of hemoglobin which is used to genomes database transport oxygen. Scratch is a directory to store temporary files generated during execution. 3.6 Simulation Points of BioInfoMark Predator: Predator [20] predicts the secondary structure of Workloads a protein sequence or a set of sequences based on their amino Our earlier study [4] shows that bioinformatics applications acid sequences. The Predator is also programmed in C. It can can execution billions of instructions before completion. be launched using the following command: ./predator -a -l Therefore, it is infeasible to simulate entire benchmark <seq> -f<output>. With the provided dataset, Predator can be execution using detailed cycle-accurate simulators. Recently, executed as:./predator -a -l eukaryota_100.seq - the computer architecture research community has widely feukaryota_100.out, where eukaryota_100.seq includes 100 adopted SimPoint [5] methodology as an efficient way to Eukaryote protein sequences from NCBI genomes database simulate the representative workload execution phases. We and eukaryota_100.out is the result of the secondary structure used the SimPoint framework developed by Calder et al. to prediction. generate the simulation points of the BioInfoMark benchmarks 3.4 Molecular Dynamics Simulation listed in the Table 1. Since the total number of instructions of Benchmarks different benchmarks varies significantly, we used the criteria suggested in [22] to determine the size of interval for each Gamess: Gamess (General Atomic and Molecular individual benchmark. Table 2 lists the interval size as well as Electronic Structure System) [21] is a general ab initio the simulation points for each benchmark. quantum chemistry package. Gamess can compute SCF wave functions and a variety of molecular properties, ranging from 4. CONCLUSIONS simple dipole moments to frequency dependent hyperpolarizabilities. Gamess is written in Fortran 77. Gamess Bioinformatics applications represent increasingly important can be invoked as ./runall >& ./runall.log. It will use 37 short computer workloads. In order to apply a quantitative approach but diverse examples named EXAM*.INP as the input dataset. in computer architecture design and performance evaluation, there is a clear need to develop a benchmark suite of 3.5 BioInfoMark Alpha Binaries for Simulation representative bioinformatics applications. This paper presents based Studies a group of programs representative of bioinformatics software. To allow computer architecture researchers to simulate the These programs include popular tools used for sequence BioInfoMark benchmarks using execution-driven and cycle- alignments, molecular phylogeny analysis, protein structure prediction and molecular dynamics. The benchmark suite 5
  • 6. BioInfoMark is freely available and can be downloaded from International Conference on Parallel Architectures and www.ideal.ece.ufl.edu/BioInfoMark. In the future, we will Compilation Techniques, 2003. explore integrated software/hardware techniques to optimize [6] J. S. Nelson, Fishes of the World, John Wiley & Sons, Inc., the performance of bioinformatics applications. New York, 1994. [7] European Molecular Biology Laboratory, Table 2. The simulation points of BioInfoMark Alpha binaries http://www.embl-heidelberg.de Benchmark Interval Simulation Points (M) [8] DNA Data Bank of Japan, http://www.ddbj.nig.ac.jp/ 412,698,242,326,810,961,503,487,354,105 fasta34 [9] The RCSB Protein Data Bank, http://www.rcsb.org/pdb/ 400 1,932,8,832,792,459,988,54,107,482,808,1 36,554,996,588 [10] The UniProt/Swiss-Prot Database, 24,918,906,701,702,674,793,395,252,845,7 http://www.ebi.ac.uk/swissprot/ clustalw 850 19,883,857,585,858,651,381,817 [11] S. Altschul, W. Gish, W. Miller, E. W. Meyers and D. J. Lipman, Basic Local Alignment Search Tool, Journal of 340,674,330,695,711,75,619,599,54,677,55 hmmsearch 680 1,672,794,40,404,682,618,370,457,951 Molecular Biology, vol. 215, no. 3, pages 403-410, 1990. [12] W.R. Pearson and D.J. Lipman, Improved tools for 581,1,45,70,32,13,2,17,1567,8,404,26,84,1 glimmer2 20 biological sequence comparison, Proc. Natl. Acad. Sci., 85 572,6,21,494,1109,40,1240 (1988), 3244–3248. 705,680,444,597,442,255,662,10,3,977,288 diffseq 35 ,443,1006,827,990,1004,343,927,707,256,9 [13] J. D. Thompson, D.G. Higgins, and T.J. Gibson, Clustal W: 8,1054,689,964,780,958,824,942 Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Positions-specific 703,559,254,596,679,443,377,825,259,106 Gap Penalties and Weight Matrix Choice, Nucleic Acids megamerger 35 7,3,442,901,255,310,593,94,773,976,758,8 Research, vol. 22, no. 22, pages 4673-4680, 1994. 89,781,1058,818,5,560 [14] S. R. Eddy, Profile Hidden Markov Models, Bioinformatics 318,1,718,981,280,115,18,355,261,365,101 Review, vol. 14, no. 9, page 755-763, 1998. shuffleseq 300 9,457,194,1018,196,43,254,406,226,775,84 2,454,894,986,776 [15] S. Salzberg, A. Delcher, S. Kasif, and O. White, Microbial Gene Identification using Interpolated Markov Models, 95,1048,1043,225,672,1,61,413,185,627,51 dnapenny 140 ,1014,1056,2,110,911,200,1003,777,970,63 Nucleic Acids Research, vol. 26, no. 2, page 544-548, 1998. 9,1053,40,438,576,597 [16] P. Rice, I. Longden, and A. Bleasby, EMBOSS: The 21,911,255,1,320,548,876,713,176,319,813 European Molecular Biology Open Software Suite, Trends in promlk 320 ,332,818,932,445,807,909,773,580,854,224 Genetics, vol. 16, no 6, page 276-277, 2000. ,719,700,969,472,715,1008,661,375,210 [17] J. Felsenstein, PHYLIP - Phylogeny Inference Package 272,247,758,195,1143,536,535,166,585,81, (version 3.2), Cladistics, 5: 164-166, 1989. predator 700 233,296,482,640,429,406,88,343,203,403,4 79,955,37,971 [18] L. Holm and J. Park, DaliLite Workbench for Protein Structure, Bioinformatics Applications Note, vol. 16, no.6, 5. REFERENCES pages 566- 567, 2000. [1] http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html [19] I. N. Shindyalov, and P. E. Bourne, Protein Structure Alignment by Incremental Combinatorial Extension (CE) of [2] Bioinformation Market Study for Washington Technology the Optimal Path, Protein Engineering, vol. 11, no. 99, page Center, Alta Biomedical Group LLC, 739-747, 1998. www.altabiomedical.com, June 2003. [20] D. Frishman, and P. Argos, 75% Accuracy in Protein [3] K. Albayraktaroglu et al., BioBench: A Benchmark Suite of Secondary Structure Prediction, Proteins, vol. 27, page 329- Bioinformatics Applications, International Symposium on 335, 1997. Performance Analysis of Software and Systems, 2005. [21] M. W. Schmidt, et al General Atomic and Molecular [4] Y. Li, T. Li, T. Kahveci and J. Fortes, Workload Electronic Structure System, Journal of Comput. Chem., vol. Characterization of Bioinformatics Applications on Pentium 14, page 1347-1363, 1993. 4 Architecture, In Proceedings of the International Symposium on Modeling, Analysis, and Simulation of [22] G. Hamerly, E. Perelman and B. Calder, How to Use Computer and Telecommunication Systems, 2005. SimPoint to Pick Simulation Points, ACM SIGMETRICS Performance Evaluation Review, 2004. [5] E. Perelman, G. Hamerly and B. Calder, Picking Statistically Valid and Early Simulation Points, In Proceedings of the 6