Data Retrieval SystemsText-based DatabaseSearching
The amount of biologically relevant dataaccessible via the WWW is increasing at a veryrapid rate.It is important for Scientists to have easy andefficient ways of wading through the data andfinding what is important for their research.Knowing how to access and search forinformation in the database is essential.
Depending on the type of data at hand, thereare two basic ways of searching:• using descriptive words to search textdatabases• using a nucleotide or protein sequence tosearch sequence databases
Text-based Database SearchingThere are three important data retrieval systemsof particular relevance to molecular biologists:Entrez (at NCBI)Sequence Retrieval System, SRS (at EBI)DBGET/LinkDB (at Japan)The advantage of these retrieval systems is thatthey not only return matches to a query, but alsoprovide handy pointers to additional importantinformation in related databases
Text-based Database SearchingThe three systems differ in the databases theysearch and the links they provide to otherinformation.In using any of these systems, queries can be assimple as entering the accession number of anewly published sequence or as complex assearching multiple database fields for specificterms
Text-based Database SearchingBasic Search Concepts• Boolean Search – An advanced query search using two or moreterms, using Boolean operator AND, OR, NOT, default – AND• Broadening the Search – If the results of a search produce nouseful entries, change or remove terms• Narrowing the Search – If the results of a search produce toomany entries, change or add terms• Proximity Searching – To search with multiword terms orphrases, place quotes around the terms• Wild Card – The character * prepended or appended to a searchterm make a search less specific., e.g., to look for all authors withlast name Zav, search using Zav*
EntrezEntrez - is a molecular biology database andretrieval system developed by the National Centerfor Biotechnology Information (NCBI)It is an entry point for exploring distinct butintegrated databases.Reference:Entrez: Molecular Biology Database and Retrieval System,Schuler GD, Epstein JA, Ohkawa H, Kans JA, MethodsEnzymol. 266, 141-62, 1996.(http://www.ncbi.nlm.nih.gov/Entrez/)
EntrezThe Entrez system provides access to:• Nucleotide sequence databases – GenBank/DDBJ/EBI• Protein sequence databases - Swiss-Prot, PIR, PRF, PDB,and translated protein sequences from DNA sequence databases• Genome and chromosome mapping data• Molecular Modeling 3-D structures Database• Literature database, PubMed - provides excellent andeasy access to MEDLINE and pre-MEDLINE articles.• Taxonomy database - allows retrieval of DNA and proteinsequences for any taxonomic group.• Specialized Databases – OMIM, dbSNP, UniSTS, etc.
EntrezThe most valuable feature of Entrez is- its exploitation of the concept of ‘neighbouring’- which allows related articles in differentdatabases to be linked to each other, whether ornot they are cross-referenced directly.Neighbours and links are listed in the order ofsimilarity to the query.The similarity is based on pre-computed analysesof sequences, structures and the literature.
EntrezOne particularly useful feature in Entrez is –The ability to retrieve large sets of data based onsome criterion and to download them to a localcomputer – Batch Entrez- allowing these sequences to be worked on usinganalytical tools available on local computer.
Entrez Features• Entrez Global Query - Search a subset of Entrez databases• Batch Entrez - Upload a file of GI or accession numbers toretrieve sequences• Making Links Entrez - Linking to PubMed and GenBank• E-Utilities - Entrez programming utilities• LinkOut - External links to related resources• Cubby - provides with a Stored Search feature to store andupdate searches, allows to customize your LinkOut display
SRSThe Sequence Retrieval System (SRS) – a networkbrowser for databases in molecular biology.It is a powerful sequence information indexing, searchand retrieval system (http://srs.ebi.ac.uk/)Reference:1. T. Etzold, A. Ulyanov, and P. Argos, SRS: InformationRetrieval System for Molecular Biology Data Banks.Methods in Enzymology, 266, 114, 1996.2. T. Etzold and P. Argos, SRS - An Indexing and Retrieval Toolfor Flat File Data Libraries. Comp. Appl. Biosciences(CABIOS), 9, 49-57, 1993.
SRSSRS is a homogeneous interface to over 80 biologicaldatabases developed at the European BioinformaticsInstitute (EBI) at Hinxton, UK.The types of databases included are sequence andsequence related, metabolic pathways, transcriptionfactors, application results (e.g., BLAST), protein 3D-structure, genome, mapping, mutations, and locus-specific mutations.One can access and query their contents and navigateamong them.
SRSThe Web page listing all the databases contains alink to a description page about the database andincludes the date of last update.One can select one or more databases to searchbefore entering the query.Over 30 versions of SRS are currently running onthe WWW. Each includes a different subset ofdatabases and associated analytical tools.
SRSSRS Features:• SRS databases are well indexed, thus reducing thesearch time for the large number of potential databases• SRS allows any flat file database to be indexed to anyother. The advantage being the derived indices may berapidly searched, allowing users to retrieve, link andaccess entries from all the interconnected resources• The system has the particular strength that it can bereadily customized to use any defined set of databanks.
SRSSimple SRS queries• By accession number• Query on accession number: J00231• By a simple author or organism name• Query on author and organism: AusubelAND Rhizobium• Boolean relations between keywords: and, or,but not
SRScontd….• Searching by dates: 01-Jan-1995: 31-Dec-1995• Searching by size: 400:600• Using hypertext links in an entry: Medline,Swiss-Prot, and PDB entries can be linked from withinthe EMBL database• Display of molecules via Rasmol plug-in
DBGETDBGET/LinkDB - is an integrated bioinformaticsdatabase retrieval system at GenomeNet, developed bythe Institute for Chemical Research, Kyoto University,and the Human Genome Center of the University ofTokyo.References:1. M. Kanehisa, Linking databases and organisms: GenomeNetresources in Japan. Trends Biochem Sci. 22, 442-444 (1997).2. Fujibuchi, W., Goto, S., Migimatsu, H., Uchiyama, I.,Ogiwara, A., Akiyama, Y., & Kanehisa, M.;DBGET/LinkDB: an integrated database retrieval system.Pacific Symp. Biocomputing 1998, 683-694 (1997).
DBGETDBGET - is used to search and extract entriesfrom a wide range of molecular biology databasesLinkDB - is used to compute links betweenentries in different databases.It is designed to be a network distributed databasesystem with an open architecture, which issuitable for incorporating local databases orestablishing a server environment.http://www.genome.ad.jp/dbget/
DBGETDBGET/LinkDB is integrated with other search tools,such as BLAST, FASTA and MOTIF to conduct furtherretrievals instantly.DBGET provides access to about 20 databases, which arequeried one at a time. After querying one of thesedatabases, DBGET presents links to associatedinformation in addition to the list of resultsA unique feature of DBGET is its connection with theKyoto Encyclopedia of Genes and Genomes (KEGG)database - a database of metabolic and regulatorypathways
DBGETDBGET has simpler, but more limited searchmethods than either SRS or Entrez.The architecture of the DBGET system:
DBGETDBGET has three basic commands (or three basicmodes in the Web version), bfind, bget, andblink, to search and extract database entries.bget – performs the retrieval of database entriesspecified by the combination of dbname:identifierbfind – is used for searching entries by keywordsNotable feature of DBGET, different from othertext search systems, is that no keyword indexingis performed when a database is installed orupdated.
DBGETSelected fields are extracted and stored in separatefiles for bfind searches.- an advantage for rapid database updates, butsometimes a disadvantage for elaborate searchingTo supplement bfind, the full text search STAG isprovidedblink - the LinkDB search. Once entries ofinterest are found, it can be used to retrieverelated entries in a given database or all databasesin GenomeNet
ExampleLet’s consider an example to show how eachsystem can be used to retrieve the SwissProt entryP04391, an ornithine carbamoyltransferase proteinin Escherichia coliIn Entrez, enter the name P04391 in the proteindatabase query form and view the entry andassociated links and neighbours
Example - LinkDBHowever, the fastest way of gathering the relatedinformation for this entry is to search LinkDB.By simply entering swissprot:P04391, a list of alllinks to all the related databases is displayed