Lecture 1 Introduction to Bioinformatics BCH 433.ppt

Introduction to Bioinformatics
BCH 433
Lecture 1
Dr. J. Ikwebe

Molecular Bioinformatics
Molecular Bioinformatics involves the use
of computational tools to discover new
information in complex data sets (from the
one-dimensional information of DNA through
the two-dimensional information of RNA and
the three-dimensional information of proteins,
to the four-dimensional information of
evolving living systems).

Bioinformatics (Oxford English Dictionary):
The branch of science concerned with
information and information flow in biological
systems, especially the use of computational
methods in genetics and genomics.

What is bioinformatics?
• The application of computational tools on
molecular data, including the means to
acquire, analyse, or visualize such data.
• Key tools to handle and analyze the large
amount of data generated by large-scale
DNA, RNA and protein characterization
projects (genomics -transcriptomics -
proteomics).

Biologists
collect molecular data:
DNA & Protein sequences,
gene expression, etc.
Computer scientists
(+Mathematicians, Statisticians, etc.)
Develop tools, softwares, algorithms
to store and analyze the data.
Bioinformaticians
Study biological questions by
analyzing molecular data
The field of science in which biology, computer science and
information technology merge into a single discipline

....
• Bioinformatics uses computers, computing technology
and software to manage large amounts of biological data
and enable their analysis.
• At the end of this course students will be expected to:
– understand biological data and data management and
integration
– have a broad knowledge of computing and biological methods in
bioinformatics
– understand genomes, genome sequencing, genomic structure
and comparison
– know about the technology used in modern post-genomic
biology, the data produced and the software to manage it.

Introduction
Large databases that can be accessed and analyzed with
sophisticated tools have become central to biological
research and education.
The information content in the genomes of organisms,
in the molecular dynamics of proteins, and in population
dynamics, to name but a few areas, is enormous.
 Biologists are increasingly finding that the management
of complex data sets is becoming a bottleneck for
scientific advances.
Therefore, bioinformatics is rapidly becoming a key
technology in all fields of biology.

The present bottlenecks in bioinformatics include;
the education of biologists in the use of advanced computing
tools,
the recruitment of computer scientists into this evolving field,
the limited availability of developed databases of biological
information,
the need for more efficient and intelligent search engines for
complex databases.
Bottlenecks

The hereditary information of all living organisms, with
the exception of some viruses, is carried by
deoxyribonucleic acid (DNA) molecules.
2 purines: 2 pyrimidines:
adenine (A) cytosine (C)
guanine (G) thymine (T)
two rings one ring

Eukaryotes may have up to 3
subcellular genomes:
1. Nuclear
2. Mitochondrial
3. Plastid
Bacteria have either circular
or linear genomes and may
also carry plasmids
The entire complement of genetic material carried by
an individual is called the genome.
Human chromosomes
Circular genome

Central dogma: DNA makes RNA makes Protein
Modified dogma: DNA makes DNA and RNA, RNA
makes DNA, RNA an Protein

Amino acids - The protein building blocks

Any region of the DNA sequence can, in principle,
code for six different amino acid sequences, because
any one of three different reading frames can be used
to interpret each of the two strands.

Protein folding
A human Haemoglobin

Some basic definitions
• Genomics---- Genome: The total genetic content contained in a
haploid set of chromosomes in eukaryotes, in a single
chromosome in bacteria, or in the DNA or RNA of viruses.
• Transcriptomics---- Transcriptome: the complete set of genes
encoded on a genome that can be transcribed.
• Proteomics---- Proteome: the complete set of proteins encoded
on a genome that can be expressed and modified by a cell,
tissue, or organism (Etymology: Protein+genome).
– Sub-cellular proteome: the complete set of proteins for a given
membrane or organelle (e.g. mitochondrial proteome).
– Membranome: the complete set of membranes from a cell.
– Metabolome: The metabolic products of the cell, that is, all the
metabolites
– Secretome: The secreted proteins of a cell?
– The phosphome:Total phosphorylated proteins of a cell?

How does it all look like on a computer monitor?

A cDNA sequence
>gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCG
CCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACC
ACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGA
CGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACA
AGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCC
GAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCG
TTAAGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGT
ACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCGGC

A cDNA sequence (reading frame)
A protein sequence
>gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCC
GCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCAC
CACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCG
ACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCAC
AAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGC
CGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACC
GTTAAGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCC
GTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCGGC
>gi|4504347|ref|NP_000549.1| alpha 1 globin [Homo sapiens]
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAH
VDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGG
GGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCT
ACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGC
CGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTC
AACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCT
CCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTTCT
TGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCG
GCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCT
GGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGAC
CTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAAC
GCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGG
TCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGC
CTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTT
CTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGG
CGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGC
CTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAG
ACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCA
ACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCC
GGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCAC
GCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGC
TTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTG
GGCGGCGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGG
ACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGT
GCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCC
ATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTG
AGTGGGCGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAA
GGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACC
ACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCG...
And, a whole genome…

E. coli 4.6 x 106 nucleotides
– Approx. 4,000 genes
Yeast 15 x 106 nucleotides
Human 3 x 109 nucleotides
Smallest human chromosome 50 x 106 nucleotides
How big are whole genomes?

What do we actually do with bioinformatics?

From DNA to Genome
Watson and Crick
DNA model
Sanger sequences
insulin protein
Sanger dideoxy
DNA sequencing
PCR (Polymerase
Chain Reaction)
1955
1960
1965
1970
1975
1980
1985
ARPANET
(early Internet)
PDB (Protein
Data Bank)
Sequence
alignment
GenBank database
Dayhoff’s Atlas

1995
1990
2000
SWISS-PROT
database
NCBI
World Wide Web
BLAST
FASTA
EBI
Human Genome
Initiative
First human
genome draft
First bacterial
genome
Yeast genome

The first protein sequence reported was that of
bovine insulin in 1956, consisting of 51
residues.
Origin of bioinformatics and
biological databases:
Nearly a decade later, the first nucleic acid
sequence was reported, that of yeast
tRNAalanine with 77 bases.

In 1965, Dayhoff gathered all the available
sequence data to create the first bioinformatic
database (Atlas of Protein Sequence and
Structure).
The Protein DataBank followed in 1972 with a
collection of ten X-ray crystallographic protein
structures. The SWISSPROT protein sequence
database began in 1987.

as of August 2011:
Eukaryotes 37
Prokaryotes 1708
Total 1745
Complete Genomes

CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT ......
.............. TGAAAAACGTA

CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT .................................
.............. TGAAAAACGTA
TF binding site
promoter
Ribosome binding Site
ORF = Open Reading Frame
CDS = Coding Sequence
Transcription
Start
Site

What is a Database?
A structured collection of data held in computer storage; esp. one
that incorporates software to make it accessible in a variety of ways;
transf., any large collection of information.
database management: the organization and manipulation of data in
a database.
database management system (DBMS): a software package that
provides all the functions required for database management.
database system: a database together with a database
management system.
Oxford Dictionary

What is a database?
• A collection of data
– structured
– searchable (index) -> table of contents
– updated periodically (release) -> new edition
– cross-referenced (hyperlinks) -> links with other db
• Includes also associated tools (software) necessary
for access, updating, information insertion,
information deletion….
• Data storage management: flat files, relational
databases…

Database or databank?
Initially
• Databank (in UK)
• Database (in the USA)
Solution
• The abbreviation db

Why biological databases?
• Exponential growth in biological data.
• Data (genomic sequences, 3D structures, 2D
gel analysis, MS analysis, Microarrays….) are
no longer published in a conventional
manner, but directly submitted to databases.
• Essential tools for biological research. The
only way to publish massive amounts of data
without using all the paper in the world.

Distribution of sequences
• Books, articles 1968 -> 1985
• Computer tapes 1982 -> 1992
• Floppy disks 1984 -> 1990
• CD-ROM 1989 ->
• FTP 1989 ->
• On-line services 1982 -> 1994
• WWW 1993 ->
• DVD 2001 ->

Some statistics
• More than 1000 different ‘biological’ databases
• Variable size: <100Kb to >20Gb
– DNA: > 20 Gb
– Protein: 1 Gb
– 3D structure: 5 Gb
– Other: smaller
• Update frequency: daily to annually to seldom to forget
about it.
• Usually accessible through the web (some free, some not)

International nucleotide data banks
EMBL
Europe
EMBL
EBI
GenBank
USA
NLM
NCBI
DDBJ
Japan
NIG
CIB
International
Advisory Meeting
Collaborative Meeting
TrEMBL NRDB

Databases
• NCBI (National Centre for Biotechnology Information):
http://www.ncbi.nlm.nih.gov/
• EBI: http://www.ebi.ac.uk/
• DDBJ: http://www.ddbj.nig.ac.jp/
• InterPro: http://www.ebi.ac.uk/interpro/
• InterPro is a database of protein families, domains and functional sites in
which identifiable features found in known proteins can be applied to
unknown protein sequences
• b) Search and analytical tools
• ORFFinder: http://www.ncbi.nlm.nih.gov/gorf/gorf.html
• It is an analysis tool which finds all open reading frames in a user's
sequence or in a sequence already in the database.
• InterProScan server: http://www.ebi.ac.uk/InterProScan/
• InterProScan is used to search various protein domain/motifs/functional
sites databases and can combine other analyses such as the identification
of potential transmembrane domains and signal peptides.

……
• PSORT: http://www.psort.org/
• This cite provides links to the PSORT family of programs for
subcellular localization prediction as well as other datasets
and resources relevant to localization prediction.
• SignalP v3.0 Server:
http://www.cbs.dtu.dk/services/SignalP/
• SignalP aims at identifying signal peptides in eukaryotes
and bacteria query proteins.
• TMHMM v2.0 server:
http://www.cbs.dtu.dk/services/TMHMM/
• TMHMM aims at identifying trans-membrane domains in
proteins (eukaryotic or prokaryotic).

 Some databases in the field of molecular biology…
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,
ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,
BioMagResBank, BIOMDB, BLOCKS, BovGBASE,
BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,
Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,
ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
GCRDB, GDB, GENATLAS, Genbank, GeneCards,
Genline, GenLink, GENOTK, GenProtEC, GIFTS,
GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,
Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5
Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,
MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,
OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,
PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,
PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,
PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,
SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,
SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,
SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-
MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,
TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,
VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,
YPM, etc .................. !!!!

Categories of databases for Life
Sciences
• Sequences (DNA, protein)
• Genomics
• Mutation/polymorphism
• Protein domain/family
• Proteomics (2D gel, Mass Spectrometry)
• 3D structure
• Metabolic networks
• Regulatory networks
• Bibliography
• Expression (Microarrays,…)
• Specialized

Bookshelf: A collection of searchable biomedical books linked to
PubMed.
PubMed: Allows searching by author names, journal titles, and a
new Preview/Index option. PubMed database provides access to
over 12 million MEDLINE citations back to the mid-1960's. It
includes History and Clipboard options which may enhance your
search session.
PubMed Central: The U.S. National Library of Medicine digital
archive of life science journal literature.
OMIM: Online Mendelian Inheritance in Man is a database of
human genes and genetic disorders (also OMIA).
Literature Databases:

.....
• BLAST is…
Basic Local Alignment Search Tool
• NCBI's sequence similarity search tool
• supports analysis of DNA and protein
databases
• 80,000 searches per day

Why use BLAST?
• BLAST searching is fundamental to understanding
the relatedness of any favourite query sequence to
other known proteins or DNA sequences.
• Applications include:
– identifying orthologs and paralogs
– discovering new genes or proteins
– discovering variants of genes or proteins
– investigating expressed sequence tags (ESTs)
– exploring protein structure and function

....
• TaxBrowser is…
• browser for the major divisions of living
organisms (archaea, bacteria, eukaryota,
viruses).
• taxonomy information such as genetic
codes.
• molecular data on extinct organisms.

What is an accession number?
• An accession number is a label that is used to identify a
sequence. It is a unique string of letters and/or numbers
that corresponds to a given molecular sequence.
• Examples:
 DNA
AF492453 GenBank genomic sequence (same at EBI)
 Protein
AAM97590 GenBank protein
Q8MV55 SwissProt protein
Non Protein Data Bank structure record
 Publication
12192407 PubMed ID - Williams et al. Nature 418: 865-9 (2002).

PubMed (Medline)
• MEDLINE covers the fields of medicine, nursing, dentistry,
veterinary medicine, public health, and preclinical sciences
• Contains citations from approximately 5,200 worldwide journals in
37 languages; 60 languages for older journals.
• Contains over 20 million citations since 1948
• Contains links to biological db and to some journals
• New records are
added to
PreMEDLINE daily!

Type in a Query term
• Enter your search words in the
query box and hit the “Go” button
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/helpdoc.html#Searching

The Syntax …
1. Boolean operators: AND, OR, NOT must be entered in
UPPERCASE (e.g., promoters OR response elements). The default
is AND.
2. Entrez processes all Boolean operators in a left-to-right sequence.
The order in which Entrez processes a search statement can be
changed by enclosing individual concepts in parentheses. The terms
inside the parentheses are processed first. For example, the search
statement: g1p3 OR (response AND element AND promoter).
3. Quotation marks: The term inside the quotation marks is read as one
phrase (e.g. “public health” is different than public health, which will
also include articles on public latrines and their effect on health
workers).
4. Asterisk: Extends the search to all terms that start with the letters
before the asterisk. For example, dia* will include such terms as
diaphragm, dial, and diameter.

Refine the Query
• Often a search finds too many (or too few) sequences, so you
can go back and try again with more (or fewer) keywords in
your query
• The “History” feature allows you to combine any of your past
queries.
• The “Limits” feature allows you to limit a query to specific
organisms, sequences submitted during a specific period of
time, etc.
• [Many other features are designed to search for literature in
MEDLINE]

The OMIM (Online Mendelian
Inheritance in Man)
– Genes and genetic disorders
– Edited by team at Johns Hopkins
– Updated daily

MIM Number Prefixes
* gene with known sequence
+ gene with known sequence and
phenotype
# phenotype description, molecular
basis known
% mendelian phenotype or locus,
molecular basis unknown
no prefix other, mainly phenotypes with
suspected mendelian basis

Searching OMIM
• Search Fields
– Name of trait, e.g., hypertension
– Cytogenetic location, e.g., 1p31.6
– Inheritance, e.g., autosomal dominant
– Gene, e.g., coagulation factor VIII

OMIM search tags
All Fields [ALL]
Allelic Variant [AV] or [VAR]
Chromosome [CH] or [CHR]
Clinical Synopsis [CS] or [CLIN]
Gene Map [GM] or [MAP]
Gene Name [GN] or [GENE]
Reference [RE] or [REF]

Online Literature databases
1. Google Scholar
2. Google Books
3. Web of Science

4. Google Scholar
http://www.scholar.google.com/

Enables you to search specifically for scholarly
literature, including peer-reviewed papers,
theses, books, preprints, abstracts and technical
reports from all broad areas of research.
What is Google Scholar?

Use Google Scholar to find articles from a
wide variety of academic publishers,
professional societies, preprint repositories
and universities, as well as scholarly articles
available across the web.

Google Scholar
orders your
search results by
how relevant they
are to your query,
so the most
useful references
should appear at
the top of the
page
This relevance
ranking takes into
account the: full
text of each article.
the article's author,
the publication in
which the article
appeared and how
often it has been
cited in scholarly
literature.

What other DATA can we retrieve from the record?

6. Web of science
http://http://apps.webofknowledge.com.ezproxy.lib.uh.edu/WOS_GeneralSearch_input.do?product
=WOS&search_mode=GeneralSearch&SID=4FB7LbbLgDMhG9fDiLh&preferencesSaved=

Genomics
• Because of the multicellular structure, each cell type
does gene expression in a different way –although each
cell has the same content as far as the genetic
constitution.
• i.e. All the information for a liver cell to be a liver cell is
also present on nose cell, so gene expression is the only
thing that differentiates

Genomics - Finding Genes
• Gene in sequence data – needle in a haystack
• However as the needle is different from the
haystack genes are not diff from the rest of the
sequence data
• Is whole array of nt we try to find and border
mark a set o nt as a gene
• This is one of the challenges of bioinformatics
• Neural networks and dynamic programming are
being employed

Organism Genome
Size (Mb)
bp * 1,000,000
Gene
Number
Web Site
Yeast 13.5 6,241 http://genome-
www.stanford.edu
/Saccharomyces
Fruit Flies 180 13,601 http://flybase.bio.
indiana.edu
Homo
Sapiens
3,000 45,000 http://www.ncbi.n
lm.nih.gov/genom
e/guide

Proteomics
• Proteome is the sum total of an organisms
proteins
• More difficult than genomics
– 4 20
– Simple chemical makeup complex
– Can duplicate can’t
• We are entering into the ‘post genome era’
• Meaning much has been done with the Genes –
not that it’s a over

Proteomics…..
• The relationship between the RNA and the
protein it codes are usually very different
• After translation proteins do change
– So aa sequence do not tell anything about the
post translation changes
• Proteins are not active until they are combined
into a larger complex or moved to a relevant
location inside or outside the cell
• So aa only hint in these things
• Also proteins must be handled more carefully in
labs as they tend to change when in touch with
an inappropriate material

Protein Structure Prediction
• Is one of the biggest challenges of
bioinformatics and esp. biochemistry
• No algorithm is there now to consistently
predict the structure of proteins

Structure Prediction methods
• Comparative Modeling
– Target proteins structure is compared with
related proteins
– Proteins with similar sequences are searched
for structures

Phylogenetics
• The taxonomical system reflects
evolutionary relationships
• Phylogenetics trees are things which
reflect the evolutionary relationship thru a
picture/graph
• Rooted trees where there is only one
ancestor
• Un rooted trees just showing the
relationship
• Phylogenetic tree reconstruction
algorithms are also an area of research

Medical Implications
• Pharmacogenomics
– Not all drugs work on all patients, some good drugs
cause death in some patients
– So by doing a gene analysis before the treatment the
offensive drugs can be avoided
– Also drugs which cause death to most can be used
on a minority to whose genes that drug is well suited
– volunteers wanted!
– Customized treatment
• Gene Therapy
– Replace or supply the defective or missing gene
– E.g: Insulin and Factor VIII or Haemophilia
• BioWeapons (??)

Diagnosis of Disease
• Diagnosis of disease
– Identification of genes which cause the
disease will help detect disease at early stage
e.g. Huntington disease -
• Symptoms – uncontrollable dance like
movements, mental disturbance, personality
changes and intellectual impairment
• Death in 10-15 years
• The gene responsible for the disease has been
identified
• Contains excessively repeated sections of CAG
• So once analyzed the couple can be counseled

Drug Design
• Can go up to 15 yrs and $700 million
• One of the goals of bioinformatics is to
reduce the time and cost involved with it.
• The process
– Discovery
• Computational methods can improves this
– Testing

Discovery
Target identification
– Identifying the molecule on which the
germs relies for its survival
– Then we develop another molecule i.e.
drug which will bind to the target
– So the germ will not be able to interact
with the target.
– Proteins are the most common targets

Discovery…
• For example HIV produces HIV protease
which is a protein and which in turn eat
other proteins
• This HIV protease has an active site
where it binds to other molecules
• So HIV drug will go and bind with that
active site
– Easily said than done!

Discovery…
• Lead compounds are the molecules that
go and bind to the target protein’s active
site
• Traditionally this has been a trial and error
method
• Now this is being moved into the realm of
computers

Restriction Analysis of DNA
• Special enzymes termed restriction enzymes have been discovered in
many different bacteria and other single-celled organisms. These
enzymes act as chemical scissors to cut λ DNA into pieces.
• They are able to scan along a length of DNA looking for a particular
sequence of bases that they recognize.
• This recognition site or sequence is generally from 4 to 6 base pairs in
length. Once it is located, the enzyme will attach to the DNA molecule
and cut each strand of the double helix- the first step in a process called
restriction mapping.
• The restriction enzyme will continue to do this along the full length of the
DNA molecule which will then break into fragments. The size of these
fragments is measured in base pairs or kilobase (1000 bases) pairs.
• Since the recognition site or sequence of base pairs is known for each
restriction enzyme, we can use this to form a detailed analysis of the
sequence of bases in specific regions of the DNA in which we are
interested.
• This procedure is one of the most important in modern biology.

.... Restriction analysis
• In the presence of specific DNA repair enzymes, DNA
fragments will re-anneal or stick themselves to other fragments
with cut ends that are complimentary to their own end
sequence.
• It doesn’t matter if the fragment that matches the cut end
comes from the same organism or from a different one.
• This ability of DNA to repair itself has been utilized by scientists
to introduce foreign DNA into an organism.
• This DNA may contain genes that allow the organism to exhibit
a new function or process. This would include transferring
genes that will result in a change in the nutritional quality of a
crop or perhaps allow a plant to grow in a region that is colder
than its usual preferred area.

Example: Restriction Digestion and
Analysis of DNA from Bacteriophage λ
• This small virus is 48,502 base pairs in length which is very
small compared with the human genome of approximately 3
billion base pairs.
• Since the whole sequence of λ is already known we can predict
where each restriction enzyme will cut and thus the expected
size of the fragments that will be produced.
• If the virus DNA is exposed to the restriction enzyme for only a
short time, then not every restriction site will be cut by the
enzyme.
• This will result in fragments ranging in size from the smallest
possible (all sites are cut) to in-between lengths (some of the
sites are cut) to the longest (no sites are cut). This is termed a
partial restriction digestion.

.....
• After overnight digestion, the reaction is
stopped by addition of a loading buffer.
• The DNA fragments are separated by
electrophoresis, a process that involves
application of an electric field to cause the
DNA fragments to migrate into an agarose
gel.
• The gel is then stained with a methylene
blue stain to visualize the DNA bands and
may be photographed.

.....
• The movement of the fragments during electrophoresis
will always be towards the positive electrode because
DNA is a negatively charged molecule.
• The fragments move through the gel at a rate that is
determined by their size and shape, with the smallest
moving the fastest.
• DNA cannot be seen as it moves through the gel. That is
why a loading dye must be added to each of the samples
before it is pipetted into the wells.
• The progress of the dye can be seen in the gel. It will
initially appear as a blue band, eventually resolving into
two bands of different colours.

......
• Restriction enzymes cut at specific sites along the DNA. These sites
are determined by the sequence of bases which usually form
palindromes.
• Palindromes are groups of letters that read the same in both the
forward and backwards orientation.
• In the case of DNA the letters are found on both the forward and the
reverse strands of the DNA.
• For example, the 5’ to 3’ strand may have the sequence GAATTC.
The complimentary bases on the opposite strand will be CTTAAG,
which is the same as reading the first strand backwards!
• Many enzymes recognize these types of sequences and will attach to
the DNA at this site and then cut the strand between two of the
bases. In this example, the DNA was digested with BamHI,
EcoRI and HindIII restriction enzymes, and their sequences are
as follows, with the cut site indicated by the arrow.

λ cut with EcoRI λ cut with HindIII λ cut with BamHI

Assignment: Using the graph in
next slide, address the following
• Calculate the size the resulting fragments will be after
digestion and write them on the map.
• How many fragments would you expect to see for each of the
maps A, B and C?
• Draw these fragments onto the graph in the next slide.
• Now compare the size of the fragments that you have
calculated with the bands shown in the photographs of the
gels and determine which of the enzymes, BamHI, EcoRI and
HindIII were used to cut A, B and C.
• How many times does the sequence GAATTC occur in the λ
DNA sequence? What about AAGCTT and GGATCC?

Lecture 1 Introduction to Bioinformatics BCH 433.ppt

Lecture 1 Introduction to Bioinformatics BCH 433.ppt

Recommended

Recommended

More Related Content

Similar to Lecture 1 Introduction to Bioinformatics BCH 433.ppt

Similar to Lecture 1 Introduction to Bioinformatics BCH 433.ppt (20)

More from KelechiChukwuemeka

More from KelechiChukwuemeka (8)

Recently uploaded

Recently uploaded (20)

Lecture 1 Introduction to Bioinformatics BCH 433.ppt