On October 23rd, 2014, we updated our
By continuing to use LinkedIn’s SlideShare service, you agree to the revised terms, so please take a few minutes to review them.
Understand the difference between data and information;
What is the purpose of a database system;
How to select a database system;
Database definitions and fundamental building blocks;
Database development: the first steps;
Quality control issues;
Data entry considerations;
What is a database
A database is any organized collection of data. Some examples of databases you may encounter in your daily life are:
a telephone book
airline reservation system
motor vehicle registration records
papers in your filing cabinet
files on your computer hard drive.
Data vs. information: What is the difference?
What is data?
Data can be defined in many ways. Information science defines data as unprocessed information.
What is information?
Information is data that have been organized and communicated in a coherent and meaningful manner.
Data is converted into information, and information is converted into knowledge.
Knowledge; information evaluated and organized so that it can be used purposefully.
Why do we need a database?
Keep records of our:
To keep a record of activities and interventions;
Keep sales records;
What is the ultimate purpose of a database management system? Data Information Knowledge Action Is to transform
More about database definition
What is a database?
Quite simply, it’s an organized collection of data. A database management system (DBMS) such as Access, FileMaker, Lotus Notes, Oracle or SQL Server which provides you with the software tools you need to organize that data in a flexible manner. It includes tools to add, modify or delete data from the database, ask questions (or queries) about the data stored in the database and produce reports summarizing selected contents.
For example: Databases in Bioinformatics
Aspira Association MIS
What is a database?
A collection of...
searchable (index) -> table of contents
updated periodically (release) -> new edition
cross-referenced ( hyperlinks ) -> links with other db
Includes also associated tools (software) necessary for db access, db updating, db information insertion, db information deletion….
Types of Databases
Non-relational databases place information in field categories that we create so that information is available for sorting and disseminating the way we need it. The data in a non-relational database, however, is limited to that program and cannot be extracted and applied to a number of other software programs, or other database files within a school or administrative system. The data can only be "copied and pasted.“ Example: a spread sheet
In relational databases, fields can be used in a number of ways (and can be of variable length), provided that they are linked in tables. It is developed based on a database model that provides for logical connections among files (known as tables) by including identifying data from one table in another table
In computer science , a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.
Data structures are used in almost every program or software system
Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks. For example, B-trees are particularly well-suited for implementation of databases, while compiler implementations usually use hash tables to look up identifiers.
Data structures are generally based on the ability of a computer to fetch and store data at any place in its memory, specified by an address --a bit string that can be itself stored in memory and manipulated by the program.
The implementation of a data structure usually requires writing a set of procedures that create and manipulate instances of that structure.
Common data structures
Array , --An array is a systematic arrangement of objects, usually in rows and columns.
linked list , -- linked list (or more clearly, "singly-linked list") is a data structure that consists of a sequence of nodes each of which contains a reference (i.e., a link ) to the next node in the sequence.
hash-table ,- hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys (e.g., a person's name), to their associated values (e.g., their telephone number).
heap , -- heap is a specialized tree -based data structure that satisfies the heap property.
B-tree , --a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic amortized time .
red-black tree , -- a type of self-balancing binary search tree , a data structure used in computing science , typically used to implement associative arrays . ---organize pieces of comparable data , such as text fragments or numbers
trie .--a trie , or prefix tree , is an ordered tree data structure that is used to store an associative array where the keys are usually strings .
Most Assembly languages and some low-level languages ex: BCPL generally lack support for data structures
Many high-level programming languages , and some higher-level assembly languages, ex: MASM , on the other hand, have special syntax or other built-in support for certain data structures,
Programming languages: supported with standard libraries that implement the most common data structures ex: the C++ Standard Template Library , the Java Collections Framework , and Microsoft 's .NET Framework .
---In the field of bioinformatics , a sequence database is a large collection of computerized (" digital ") nucleic acid sequences , protein sequences , or other sequences stored on a computer. A database can include sequences from only one organism (e.g., a database for all proteins in Saccharomyces cerevisiae ), or it can include sequences from all organisms whose DNA has been sequenced.
Ex: Protein structure database- - In biology , a protein structure database is a database that is modeled around the various experimentally determined protein structures . The aim of most protein structure databases is to organize and annotate the protein structures, providing the biological community access to the experimental data in a useful way.
Examples of protein structure databases include (in alphabetical order);
Database of Macromolecular Movements describes the motions that occur in proteins and other macromolecules, particularly using movies JenaLib the Jena Library of Biological Macromolecules is aimed at a better dissemination of information on three-dimensional biopolymer structures with an emphasis on visualization and analysis. MODBASE a database of three-dimensional protein models calculated by comparative modeling PDBe the European resource for the collection, organisation and dissemination of data on biological macromolecular structures, and a member of the Worldwide Protein Data Bank . OCA a browser-database for protein structure/function - The OCA integrates information from KEGG , OMIM , PDBselect , Pfam , PubMed , SCOP , SwissProt , and others. OPM provides spatial positions of protein three-dimensional structures with respect to the lipid bilayer . PDB Lite derived from OCA, PDB Lite was provided to make it as easy as possible to find and view a macromolecule within the PDB PDBsum provides an overview macromolecular structures in the PDB, giving schematic diagrams of the molecules in each structure and of the interactions between them PDBTM the Protein Data Bank of Transmembrane Proteins — a selection of the PDB. PDBWiki a community annotated knowledge base of biological molecular structures  Protein the NIH protein database, a collection of sequences from several sources, including translations from annotated coding regions in GenBank , RefSeq and TPA , as well as records from SwissProt , PIR , PRF , and PDB Proteopedia the collaborative, 3D encyclopedia of proteins and other molecules. A wiki that contains a page for every entry in the PDB (>50,000 pages), with a Jmol view that highlights functional sites and ligands. Offers an easy-to-use scene-authoring tool so you don't have to learn Jmol script language to create customized molecular scenes. Custom scenes are easily attached to "green links" in descriptive text that display those scenes in Jmol. SCOP the Structural Classification of Proteins  a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. SWISS-MODEL Repository a database of annotated protein models calculated by homology modeling TOPSAN the Open Protein Structure Annotation Network — a wiki designed to collect, share and distribute information about protein three-dimensional structures. Retrieved from " http:// en.wikipedia.org/wiki/Protein_structure_database "
Def: The term " sequence analysis " in biology implies subjecting a DNA or peptide sequence to sequence alignment , sequence databases , repeated sequence searches, or other bioinformatics methods on a computer.
Sequence analysis in molecular biology and bioinformatics is an automated, computer-based examination of characteristic fragments, e.g. of a DNA strand. It basically includes relevant topics:
The comparison of sequences in order to find similarity and dissimilarity in compared sequences (sequence alignment)
Identification of gene-structures , reading frames , distributions of introns and exons and regulatory elements
Finding and comparing point mutations or the single nucleotide polymorphism (SNP) in organism in order to get the genetic marker.
Revealing the evolution and genetic diversity of organisms.
Function annotation of genes.
In chemistry , sequence analysis comprises techniques used to do determine the sequence of a polymer formed of several monomers . In molecular biology and genetics , the same process is called simply " sequencing ".
In marketing , sequence analysis is often used in analytical customer relationship management applications, such as NPTB models (Next Product to Buy).
Sequence Analysis in Molecular Biology:
Sequence Alignment is a way of arranging the sequences of DNA , RNA , or protein sequences to identify regions of similarity. It generally falls into two types:
-Pairwise alignment: Alignment between two sequences
-Multiple alignment: Alignment between more than two sequences
Existing methods for pairwise alignment include: Needleman- Wunsch algorithm , Smith-Waterman algorithm , and BLAST
The tasks that lie in the space of sequence analysis are often non-trivial to resolve and require the use of relatively complex approaches. Of the many types of methods used in practice, the most popular include:
Artificial Neural Network ,
Hidden Markov Model
Support Vector Machine
List of Computational Chemistry Software – Resources
Bioinformatics Software Cheminformatics Software LIMS Software Computer-Assisted Molecular Modeling Software CADD - Biopolymer Modeling Software CADD - General Modeling Software CADD - Conformational Search Software CADD - General Tools CADD - Molecular Mechanics/Dynamics Software CADD - Quantum Chemistry Software CADD - Display Software Structural Chemistry Software Structural Chemistry Software for Xray Analysis Structural Chemistry Software for IR Analysis Structural Chemistry Software for MS Analysis Structural Chemistry Software for NMR Analysis General Software Tools
Lists of Software for Bioinformatics:
Sequence Databases : ex: AceDB ( genome database ); The BioCyc (databases provides electronic reference sources on the pathways and genomes of different organisms ); Biopendium: (brings together information on sequence, structure and function relationships for all gene products in the public domain.); CAMELEON is a set of multiple sequence alignment tools with links to databases of known 3D structural fragments ); ERGO Light is a curated database of public and proprietary genomic DNA, with connected similarities, functions, pathways, functional models, clusters and more ; Expasy site contains a 2-D gel data database, searching engine and links to several gel databases throughout the world. ); GAIA 22 is a Chromosome 22 specific version of the GAIA database. GAIA is a data analysis and storage system for genomic sequence and its annotation. As a data analysis engine it accepts raw genomic sequence and automatically adds significant annotation ); GeneCards is a database of human genes, their products and their involvement in diseases ); GENESEQ was a database of protein and nucleic acid sequences extracted from world-wide patent documents ; GeneWorks - was an integrated sequence analysis and database searching ; ISYS(TM) , is the National Center for Genome Resources' new product that integrates independent bioinformatic software tools and databases ); OligoMaster is a multi-user oligonucleotide cataloguing application designed to help biologists manage and organise their oligonucleotide collections, available in versions for Windows, Macintosh and Linux); PhyloPat provides phylogenetic pattern analysis of eukaryotic genes.; ProteinCenter(™) integrates the contents of a large number of public protein sequence databases and your experimental systems biology data. Relibase is a web-based tool for searching and analysing protein ligand structures in the PDB);
ResNet is a comprehensive database of molecular networks and protein interactions, derived from automatic analysis of the whole PubMed.; The Rosetta Resolver System , provides high-capacity data storage, retrieval and analysis of gene expression data. The system is ideal for life science research organizations that need to assess compound specificity or toxicity, identify new genes or therapeutic targets, or compare and analyze large databases of expression profiles.; SGD is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast; SRS is a database integration and biological information search system. It is capable of quering 400 different molecular biology, bibliographic, compound data, genetic and medical databases via a single interface. ; Software Solution for BioMedicine (SSBM) offers high-speed analysis of both public and proprietary genetic databases within the security of the corporate firewall; Vector NTI is a Macintosh- and Windows-based molecular biology support system .
AAT - Analysis and Annotation Tool used to identify genes by comparing cDNA and protein sequence databases.
ABI PRISM ; AcaClone pDRAW32 ; AGCT ; AlleleID ; Antheprot Protein analysis software ; Array Designer 4 : arraySCOUT TM is a gene expression data analysis application ; Artemis is a free genome viewer and annotation tool ; Asterias is a suite of freely- accessible web-based genomic data analysis programs ; Bio Image is a life sciences software information company which carries a wide variety of electrophoresis image analysis software for Windows, Powermac, and UNIX ; BioinformatiX is an enterprise software which provides an environment for the analysis of microarray data. ; BioRainbow Analysis Tools are a collection of software tools for binding site prediction, weight matrix search, regulatory sequences analysis, microarray analysis, footprint ; bioSCOUT® is a comprehensive and customizable bioinformatics package ; BioTools offers three primary bioinformatics products: GeneTool for DNA sequence analysis, PepTool for protein sequence analysis, and ChromaTool for chromatogram analysis; BlockSearch is a quantitative method for the elucidation of unknown protein functions; Bosque ( http:// bosque.udec.cl ) is a distributed software environment oriented to manage the computational resources involved in typical phylogenetic analyses Clann: Software for investigating phylogenomic information using supertrees ; CURVES , by Richard Lavery and Heinz Sklenar is a very useful nucleic acid helical analysis program. DNADynamo is a general purpose software for DNA and Protein sequence analysis DNASIS is a robust sequence analysis software package that delivers industry standard functionality DNPTrapper is a shotgun sequencing assembly editing tool, specifically designed for finishing and analysis of repeated regions. EuGene and SAm is a menus based DNA and protein sequence analysis package Genchek , developed by Ocimum Biosolutions is a comprehensive, LIMS based, user friendly Nucleotide and Polypeptide Sequence Analysis Tool with a backend Relational Database Genehound(™) offers a new, innovative, and exciting apporach to identifying coding regions in prokaryotic genomes GeneInform is an easy-to-operate gene expression management and analysis tool that saves cost and time by facilitating the collection, storage, analysis, and sharing of gene expression data
Gene Inspector(™)1.5: A powerful and versatile combination of an electronic laboratory notebook and sequence analysis package for biologists. GeneLinker products are the easiest way for researchers to start analyzing gene expression data. GeneJockey is a program for editing, manipulation, and analysis of nucleic acid and protein sequences. GENEMARK is a genefinding tool available from the Georgia Institute of Technology that uses an algorithm based on non-homogenous Markov chain models. GENEPARSER is a coding region recognition program from the University of Colorado that uses potential similarity between query sequence and known amino acid sequences. GeneSifter ™, a Web-based microarray analysis system that combines data management and analytical functions with integrated, current gene annotation from databases such as Unigene and LocusLink. GeneSolve is a single-User desktop sofware package for analyzing nucleic acid sequence infromation GeneStudio Pro from GeneStudio, Inc. ( http:// www.genestudio.com ) is a newly developed suite of molecular biology programs for Windows GeneWorks - an integrated sequence analysis and database searching on the Macintosh previously marketed by Oxford Molecular Group GenomeBrowser is a powerful software tool that simplifies the proccess of analysis, annotation, and manipulation of genetic sequences. Genie , from LBNL, is a gene finder based on generalized hidden Markov models to locate multi-exon genes. Etc…
Data stored in tables that are associated by shared attributes (keys).
Any data element (or entity) can be found in the database through the name of the table, the attribute name, and the value of the primary key.
Relational Database Definitions
Entity: Object, Concept or event (subject)
Attribute: a Characteristic of an entity
Row or Record: the specific characteristics of one entity
Table: a collection of records
Database: a collection of tables
Overview of Phylogenetic Analysis
Phylogenetic analysis is the process you use to determine the evolutionary relationships between organisms.
The results of an analysis can be drawn in a hierarchical diagram called a cladogram or phylogram (phylogenetic tree).
The branches in a tree are based on the hypothesized evolutionary relationships (phylogeny) between organisms.
Each member in a branch, also known as a monophyletic group, is assumed to be descended from a common ancestor.
Originally, phylogenetic trees were created using morphology, but now, determining evolutionary relationships includes matching patterns in nucleic acid and protein sequences.
-----phylogenetic tree is constructed from mitochondrial DNA (mtDNA) sequences for the
family Hominidae. This family includes gorillas, chimpanzees, orangutans, and humans.
Searching NCBI for Phylogenetic Data
The NCBI taxonomy Web site includes phylogenetic and taxonomic information from many sources. These sources include the published literature, Web databases, and taxonomy experts. And while the NCBI taxonomy database is not a phylogenetic or taxonomic authority, it can be useful as a gateway to the NCBI biological sequence databases
Principles of data organization
Database --a collection of related structured information about entities
File -- a collection of records
Record--a set of fields
Field --a single characteristic of an entity
Character--a symbol used in data field
Selecting a Database Management System
Database management systems (or DBMSs) can be divided into two categories -- desktop databases and server databases.
Generally speaking, desktop databases are oriented toward single-user applications and reside on standard personal computers (hence the term desktop).
Server databases contain mechanisms to ensure the reliability and consistency of data and are geared toward multi-user applications.
Selecting a database system: Need Analysis
The needs analysis process will be specific to your organization but, at a minimum, should answer the following questions:
How many records we will warehouse and for how long?
Who will be using the database and what tasks will they perform?
How often will the data be modified? Who will make these modifications?
Who will be providing IT support for the database?
What hardware is available? Is there a budget for purchasing additional hardware?
Who will be responsible for maintaining the data?
Will data access be offered over the Internet? If so, what level of access should be supported?
A File: A group or collection of similar records, like INST6031 Fall Student File, American History 1850-1866 file, Basic Food Group Nutrition File
A record book: a "rolodex" of data records, like address lists, inventory lists, classes or thematic units, or groupings of other unique records that are combined into one list (found in AppleWorks, FileMaker Pro software).
A field : one category of information, i.e., Name, Address, Semester Grade, Academic topic
A record : one piece of data, i.e., one student's information, a recipe, a test question
A layout : a design for a database that contains field names and possibly graphics.
Tables comprise the fundamental building blocks of any database. If you're familiar with spreadsheets, you'll find database tables extremely similar. Take a look at this example of a table sample database:
The table above contains the employee information for our organization -- characteristics like name, date of birth and title. Examine the construction of the table and you'll find that each column of the table corresponds to a specific employee characteristic (or attribute in database terms). Each row corresponds to one particular employee and contains his or her information. That's all there is to it! If it helps, think of each one of these tables as a spreadsheet-style listing of information.
Fundamental building blocks
Where do we start?
Let’s explore your “paper system”
Client intake forms
Job application form
Define required fields from “forms” or required reports
Keep it simple
Identify a unique identifier or primary key
Some Quality Control Considerations
Remember “garbage in – garbage out”. Some examples and how to prevent this.
Quality management encompasses three distinct processes: quality planning, quality control, and quality improvement
Quality Planning in relation to database systems design:
Who will perform data entry?
Training? On-line help?
How data entry will be performed?
Data entry considerations
Define “must” enter fields – no record is complete unless: such and such is entered;
Make data entry fool proof. Example: Grade level can be entered as a number (8 or 8 th or eight). By using a pull-down menu with the correct data format these mistakes can be avoided.
Data Entry – additional considerations
Wireless attached to a Palm or Pocket PC
WiFi 802.11g, Bluetooth
Wireless networks (real-time on demand systems)
PEOPLE THAT WORK WITH DATABASES
communicate with each prospective database user group in order to understand its
develop a specification of each user group’s information and processing needs
develop a specification integrating the information and processing needs of the user groups
document the specification
choose appropriate structures to represent the information specified by the system analysts
choose appropriate structures to store the information in a normalized manner in order to guarantee integrity and consistency of data
choose appropriate structures to guarantee an efficient system
document the database design
implement the database design
implement the application programs to meet the program specifications
test and debug the database implementation and the application programs
document the database implementation and the application programs
Manage the database structure
Manage data activity
Manage the database management system
generate database application performance reports
investigate user performance complaints
assess need for changes in database structure or application design
modify database structure
evaluate and implement new DBMS features
tune the database
Establish the database data dictionary
data names, formats, relationships
cross-references between data and application programs
Parametric end users constantly query and update the database. They use canned transactions to support standard queries and updates.
Casual end users occasional access the database, but may need different information each time. They use sophisticated query languages and browsers.
Sophisticated end users have complex requirement and need different information each time. They are thoroughly familiar with the capabilities of the DBMS.