1. Genome Data ManagementGenome Data Management
Shabeer Ismaeel
MSC IT II
SEMESTER
Department Of Information
Technology .
2. • Biological Sciences.Biological Sciences.
• Genetics.Genetics.
• Characteristics of Biological Data.Characteristics of Biological Data.
• What is Bioinformatics?What is Bioinformatics?
• Human Genome and availability ofHuman Genome and availability of
information .information .
• Existing Biological Databases.Existing Biological Databases.
• Various Branches Benefited.Various Branches Benefited.
Contents
3. Biological Sciences.Biological Sciences.
– The biological sciences encompass an enormousThe biological sciences encompass an enormous
variety of information.variety of information.
• EnvironmentalEnvironmental sciencescience gives us a view of how speciesgives us a view of how species
live and interact in a world filled with natural phenomena.live and interact in a world filled with natural phenomena.
• BiologyBiology andand ecologyecology study particular species.study particular species.
• AnatomyAnatomy focuses on the overall structure of an organism,focuses on the overall structure of an organism,
documenting the physical aspects of individual bodies.documenting the physical aspects of individual bodies.
• Traditional medicine and physiologyTraditional medicine and physiology break the organismbreak the organism
into systems and tissues and strive to collect informationinto systems and tissues and strive to collect information
on the workings of these systems and the organism as aon the workings of these systems and the organism as a
wholewhole..
4. • Histology and cell biologyHistology and cell biology delve into thedelve into the
tissue and cellular levels and providetissue and cellular levels and provide
knowledge about the inner structure andknowledge about the inner structure and
function of the cell.function of the cell.
-This wealth of information that has been-This wealth of information that has been
generated, classified, and stored forgenerated, classified, and stored for
centuries has only recently become acenturies has only recently become a
major application of database technology.major application of database technology.
5. Genetics.Genetics.
• GeneticsGenetics has emerged as an ideal fieldhas emerged as an ideal field
for the application of informationfor the application of information
technology.technology.
– In a broad sense, it can be taught of as theIn a broad sense, it can be taught of as the
construction of models based onconstruction of models based on
information about genes and populationinformation about genes and population
and the seeking out of relationships in thatand the seeking out of relationships in that
information.information.
• Genes can be defined as units of heredity.Genes can be defined as units of heredity.
6. -The study of genetics can be divided into three-The study of genetics can be divided into three
branches:branches:
MendelianMendelian geneticsgenetics is the study of theis the study of the
transmission of traits between generations.transmission of traits between generations.
MolecularMolecular geneticsgenetics is the study of the chemicalis the study of the chemical
structure and function of genes at the molecularstructure and function of genes at the molecular
level.level.
PopulationPopulation geneticsgenetics is the study of how geneticis the study of how genetic
information varies across populations ofinformation varies across populations of
organisms.organisms.
7. The origins ofThe origins of molecular geneticsmolecular genetics can be traced tocan be traced to
two important discoveries:two important discoveries:
- In 1869 when Friedrich Miescher discovered- In 1869 when Friedrich Miescher discovered
Nuclein and its primary component,Nuclein and its primary component,
deoxyribonucleic acid (DNA).deoxyribonucleic acid (DNA).
In subsequent research DNA and a related compound,In subsequent research DNA and a related compound,
ribonucleic acid, were found to be composed of nucleotides (aribonucleic acid, were found to be composed of nucleotides (a
sugar, a phosphate, and a base combining to form nucleic acid)sugar, a phosphate, and a base combining to form nucleic acid)
linked into long polymers via the sugar and phosphate.linked into long polymers via the sugar and phosphate.
--The second discovery was the demonstration inThe second discovery was the demonstration in
1944 by Oswald Avery that DNA was indeed the1944 by Oswald Avery that DNA was indeed the
molecular substance carrying genetic information.molecular substance carrying genetic information.
8. Genes were shown to be composed of chains ofGenes were shown to be composed of chains of
nucleic acids arranged linearly on chromosomes andnucleic acids arranged linearly on chromosomes and
to serve three primary functions:to serve three primary functions:
-Replicating genetic information between-Replicating genetic information between
generations,generations,
-Providing blueprints for the creation of polypeptides,-Providing blueprints for the creation of polypeptides,
andand
-Accumulating changes– thereby allowing evolution-Accumulating changes– thereby allowing evolution
to occur.to occur.
------------------Watson and Crick found the double-helixWatson and Crick found the double-helix
structure of the DNA in 1953, which gave molecularstructure of the DNA in 1953, which gave molecular
biology a new direction.biology a new direction.
9. Characteristics of Biological DataCharacteristics of Biological Data
• Biological data exhibits many specialBiological data exhibits many special
characteristics that make managementcharacteristics that make management
of biological information a particularlyof biological information a particularly
challenging problem.challenging problem.
• The characteristics related to biologicalThe characteristics related to biological
information is calledinformation is called Bioinformatics.Bioinformatics.
10. What is Bioinformatics?What is Bioinformatics?
• Bioinformatics is the field of science in which
biology, computer science, and information
technology merge into a single discipline.
• The ultimate goal of the field is to enable the
discovery of new biological insights as well as to
create a global perspective from which unifying
principles in biology can be detected.
• There are three important sub-disciplines within
bioinformatics which include:
11. 1.1. The development of new algorithms andThe development of new algorithms and
statistics with which to assess relationshipsstatistics with which to assess relationships
among members of large data sets.among members of large data sets.
2.2. The analysis and interpretation of various typesThe analysis and interpretation of various types
of data including nucleotide and amino acidof data including nucleotide and amino acid
sequences, protein domains, and proteinsequences, protein domains, and protein
structures.structures.
3.3. The development and implementation of toolsThe development and implementation of tools
that enable efficient access and management ofthat enable efficient access and management of
different types of information.different types of information.
14. Various characteristicsVarious characteristics
Biological data is highly complex when comparedBiological data is highly complex when compared
with most other domains or applications.with most other domains or applications.
The amount and range of variability in data is high.The amount and range of variability in data is high.
Schemas in biological databases change at a rapidSchemas in biological databases change at a rapid
pace.pace.
Representations of the same data by differentRepresentations of the same data by different
biologists will likely be different (even using thebiologists will likely be different (even using the
same system).same system).
Most users of biological data do not require writeMost users of biological data do not require write
access to the database; read-only access isaccess to the database; read-only access is
adequate.adequate.
15. Most biologists are not likely to have
knowledge of the internal structure of the
database or about schema design.
The context of data gives added meaning for
its use in biological applications
Defining and representing complex queries is
extremely important to the biologist.
Users of biological information often require
access to “old” values of the data –
particularly when verifying previously reported
results.
16. What is the Human Genome?What is the Human Genome?
-The term genome is defined as the total genetic
information that can be obtained about an entity.
E.g., the human genome generally refers to the
complete set of genes required to create a
human being.
-The number is estimated to be more than
30,000 genes spread over 23 pairs of
chromosomes, with an estimated 3 to 4
billion nucleotides.
---The goal of the Human Genome Project (HGP
Began in 1990 ) is to obtain the complete
sequence – the ordering of the bases – of those
nucleotides.
17. Existing Biological Databases.Existing Biological Databases.
• Some of the existing database systems that areSome of the existing database systems that are
supporting or have grown out of the Human Genomesupporting or have grown out of the Human Genome
Project include:Project include:
• GenBankGenBank
– The notable DNA sequence database in the world today isThe notable DNA sequence database in the world today is
GenBank, maintained by the National Center forGenBank, maintained by the National Center for
Biotechnology Information (Biotechnology Information (NCBINCBI) of the National Library of) of the National Library of
Medicine (Medicine (NLMNLM).).
– Established in 1978 as a secret storage for DNA sequenceEstablished in 1978 as a secret storage for DNA sequence
data.data.
– Since 1978 expanded to include sequence tag data, proteinSince 1978 expanded to include sequence tag data, protein
sequence data, three-dimensional protein structure,sequence data, three-dimensional protein structure,
taxonomy, and links to the medical literature (MEDLINE).taxonomy, and links to the medical literature (MEDLINE).
18. - GenBank contains over 31 billion nucleotide bases of
more than 24 million sequences from over 100,000
species with roughly 1400 new organisms being added
each month.
-The database size in flat file format is over 100 GB
uncompressed and has been doubling every 15 months.
-The system is maintained as a combination of flat files,
relational databases, and files containing Abstract Syntax
Notation One (ASN.1 rules for encoding and decoding
data) .
19. • The Genome Database (GDB)The Genome Database (GDB)
--Created in 1989, GDB is a catalog of human gene mappingCreated in 1989, GDB is a catalog of human gene mapping
data, a process that associates a piece of information with adata, a process that associates a piece of information with a
particular location on the human genome.particular location on the human genome.
--The GDB system is built around Sybase, aThe GDB system is built around Sybase, a
commercial relational DBMS, and its data arecommercial relational DBMS, and its data are
modeled using standard Entity-Relationshipmodeled using standard Entity-Relationship
techniques.techniques.
------GDB distributes a Database Access Toolkit.------GDB distributes a Database Access Toolkit.
20. Online Mendelian Inheritance in ManOnline Mendelian Inheritance in Man
• Online Mandelian Inheritance in Man (Online Mandelian Inheritance in Man (OMIMOMIM) is) is
an electronic collection of information on thean electronic collection of information on the
genetic basis of human disease.genetic basis of human disease.
• In 1991 its administration was transferred fromIn 1991 its administration was transferred from
John Hopkins University to the NCBI(John Hopkins University to the NCBI(NationalNational
Center For Biotechnology InformationCenter For Biotechnology Information), and the), and the
entire database was converted to NCBI’sentire database was converted to NCBI’s
GenBank format. Today it contains more thanGenBank format. Today it contains more than
14,000 entries.14,000 entries.
21. EcoCycEcoCyc
– The Encyclopedia ofThe Encyclopedia of Escherichia coliEscherichia coli
Genes and Metabolism (Genes and Metabolism (EcoCycEcoCyc) is a recent) is a recent
experiment in combining information aboutexperiment in combining information about
the genome and the metabolism of E.coli K-the genome and the metabolism of E.coli K-
12(Bacteria).12(Bacteria).
– The database was created in 1996 as aThe database was created in 1996 as a
collaboration between Stanford Researchcollaboration between Stanford Research
Institute and Marine Biological Laboratory.Institute and Marine Biological Laboratory.
22. Gene OntologyGene Ontology
– Gene Ontology (GO) Consortium was formed inGene Ontology (GO) Consortium was formed in
1998 as a collaboration among three model1998 as a collaboration among three model
organism databases: FlyBase, Mouse Genomeorganism databases: FlyBase, Mouse Genome
Informatics (MGI) and Saccharomyces or yeastInformatics (MGI) and Saccharomyces or yeast
Genome Database (SGD).Genome Database (SGD).
• The goal is to produce a structured, precisely defined,The goal is to produce a structured, precisely defined,
common, controlled vocabulary for describing the roles ofcommon, controlled vocabulary for describing the roles of
genes and gene products in any organismgenes and gene products in any organism..
• Latest release of GO database has over 13,000 terms and moreLatest release of GO database has over 13,000 terms and more
than 18,000 relationships between terms.than 18,000 relationships between terms.
• GO was implemented using MySQL, an open source relationalGO was implemented using MySQL, an open source relational
database and a monthly database release is available in SQL anddatabase and a monthly database release is available in SQL and
XML(Extensible Markup Language) formats.XML(Extensible Markup Language) formats.
23. Summary Of the MajorSummary Of the Major
Genome-Related DatabasesGenome-Related Databases
24. Various Branches Benefited.Various Branches Benefited.
• Medicine
• PharmacogenomicsPharmacogenomics
• Biotechnology
• Bioinformatics
• Proteomics