3. Types of data available
• Enormous amounts of data available publicly
– DNA/RNA Sequence
– SNPs
– Protein Sequence
– Protein Structure
– Protein Function
– Organism‐specific Databases
– Genomes
– Gene Expression
– Biomolecular Interactions
– Molecular Pathways
– Scientific Literature
– Disease Information
4. Basic concepts in Molecular Biology
• Biological macromolecules, (DNA, RNA, Proteins) drives the
functioning of the whole organism as well as the evolutionary engine
• Understanding of the basis of life is fundamental to understanding
how genetic information shapes life and drives its evolution.
• According to dogma of molecular biology “DNA can be copied to
DNA (DNA replication), DNA information can be copied into
mRNA (transcription), and proteins can be synthesized using the
information in mRNA as a template (translation).
• The special transfers describe: RNA being copied from RNA (RNA
replication), DNA being synthesised using an RNA template (reverse
transcription), and proteins being synthesised directly from a DNA
template without the use of mRNA.
• The transfer of information from nucleic acid to nucleic acid, or from
nucleic acid to protein may be possible, but transfer from protein to
protein, or from protein to nucleic acid is impossible.
5. The Nucleic Acids
• There are two kinds of nucleic acids, deoxyribonucleic acid (DNA) and
ribonucleic acid (RNA).
• The nucleic acids (DNA and RNA) are the molecular repositories for
genetic information and are jointly referred to as the ‘molecules of
heredity’.
• The structure of every protein, and ultimately of every cell constituent, is
a product of information programmed into the nucleotide sequence of a
cell’s nucleic acids.
• Deoxyribonucleic acid (DNA) is a macromolecule that carries genetic
information from generation to generation.
• With some exceptions, deoxyribonucleic acid (DNA) is the universal
genetic material.
• In some viruses, termed RNA viruses, RNA is the genetic material.
• The viruses which have a enduring single- and double stranded RNA
genomes are called Ribovirus
• The viruses which have which are RNA-based for a portion of their life
cycle are called Retroviruses
6. Structural Units of DNA
• DNA is a double-stranded right-handed helix; the two strands are
complementary because of complementary base pairing, and antiparallel
because the two strands have opposite 5/-3/ orientation.
• The diameter of the helical DNA molecule is 20A ° (52 nm).
• The helical conformation of DNA creates the alternate major groove and
minor groove.
• DNA is composed of structural units called nucleotides
(deoxyribonucleotides). Each nucleotide is composed of a pentose sugar
(20-deoxy-D-ribose); one of the four nitrogenous bases: (Purines) adenine
(A) or guanine (G), (Pyrimidines) thymine (T) or cytosine (C); and a
phosphate.
• The pentose sugar has five carbon atoms and they are numbered 1/ (1-
prime) through 5/ (5-prime).
• The base is attached to the 1/ carbon atom of the sugar, and the phosphate
is attached to the 5/ carbon atom.
• The sugar and base form a nucleoside, whereas nucleoside plus phosphate
makes a nucleotide.
7. Linkage between Nucleotides
• The nucleotides are joined by 5/-3/ phosphodiester linkage; that is, the 5/-
phosphate of a nucleotide is linked to the 3/-OH of the preceding
nucleotide by a phosphodiester linkage.
• In a linear DNA molecule, the 5/-end has a free phosphate and the 3/-end
has a free OH group.
• Each phosphodiester bond has two sides: a 3/-side that is linked to the 3/-
end of the preceding nucleotide, and a 5/-side that is linked to 5/-end of
the following nucleotide.
• The 3/-side is called the A side by convention and its cleavage generates a
5/-PO4 product.
• The 5/-side is called the B side by convention and its cleavage generates a
3/-PO4 product.
8. Base-Pairing Rules in DNA
• In the double-stranded DNA, A pairs with T by two hydrogen bonds
and G pairs with C by three hydrogen bonds.
• Thus GC-rich regions of DNA have more hydrogen bonds and
consequently are more resistant to thermal denaturation.
• In the helical double-stranded DNA molecule, the sugar-phosphate
backbone lies outside and the bases are inside.
• In double-stranded DNA, a purine pairs with a pyrimidine (A with
T and G with C). Therefore, total amount of purine should equal
total amount of pyrimidine (Chargaff’s rule).
• In the bases, the side with the N1 position of the heterocyclic ring is
the “front,” also called the WatsonCrick edge.
• The opposite side is the “back,” also called the Hoogsteen edge.
• The Hoogsteen edge of the bases is located towards the edge
(outside) of the DNA double helix, whereas the WatsonCrick edge
is internal.
12. Genes and Genetics
• Gene: The basic unit of heredity. A sequence of DNA nucleotides on a
chromosome that codes for a polypeptide or RNA molecule and thus determines
of an individual’s inherited traits.
• In physical terms, the gene is defined as the coding region of DNA that
determines a protein product.
• Marker (locus): A specific position in chromosome. It may be 1 bp or several
hundred bps in length.
• Alleles: DNA sequences within a marker or locus.
• Gene expression: Transcription and, in the case of proteins, translation to yield
the product of a gene; a gene is expressed when its biological product is present
and active.
• Genetic code: The “language” of the genes. The set of triplet code words in
DNA (or mRNA) that code for the amino acids of proteins.
• Of the 64 possible codons, 61 are the codes for amino acids, and the remaining
being termination codons that are not translated.
• With a few minor exceptions, all living beings use the same code, i.e., the
genetic code is universal.
• Genome: Total genetic information encoded in a cell or an organism, or a virus.
• Genetics: The study of genes, genetic variations and heredity in living
organisms.
• Genetic map: A diagram showing the relative sequence and position of specific
genes along a chromosome.
14. Typical Eukaryotic Gene Structure
Transcribed Region
• For any given gene, one of the two strands of DNA is transcribed,
the other is not.
• The DNA strand that is NOT transcribed is called the sense or plus
(+), or coding strand because it has the same sequence as that of the
mRNA (except for U in RNA and T in DNA).
• The strand that is transcribed is called the template or antisense or
minus (-) or noncoding strand because its sequence is
complementary to the coding sequence.
• A typical mRNA coding eukaryotic gene has three major parts: a
transcribed region, a 5/-flanking region, and a 3/-flankng region.
• In eukaryotes, different types of RNAs are transcribed from the
DNA by different RNA polymerases: RNA polymerase I (pol I)
transcribes ribosomal RNA (rRNA), RNA polymerase II (pol II)
transcribes messenger RNA (mRNA), RNA polymerase III (pol III)
transcribes transfer RNA (tRNA).
18. Typical Eukaryotic Gene Structure
Splicing and Post Transcriptional Modification
• For mRNA, the primary transcript that contains both exons and introns is
called the heterogeneous nuclear RNA (hnRNA) or premRNA.
• The hnRNA is processed to remove the introns (splicing), add a 7-methyl
guanine cap at the 5/-end by 5/-5/ linkage, and add a poly(A) tail at the 30-
end, which is about 200 bp long in mammals.
• Most introns in genes have GT at the 5/-splice site (in the DNA sense
strand; hence GU in the hnRNA), called the splice donor site, and AG at
the 3/-splice site, called the splice acceptor site.
• These introns are referred to as GT-AG introns.
• However, introns may also contain GC or AT as the splice donor sites, and
AC as the splice acceptor site (hence, GCaAG introns, AT-AC introns).
20. 5/-Flanking Region of Transcribed Genes
• A region of DNA which is NOT transcribed into RNA, but rather is adjacent to 5/
end of the gene and ontains the promoter, enhancers or other protein binding sites
• The promoter: specific sequences for binding the proteins necessary for
transcription by RNA pol. (RNA Pol II in eukaryotes)
• TATA box (consensus 5/-TATAAA-3/): located 25-30 bp upstream of the
transcription start site (225-230 bp position).
• TATA-less promoters: Mediated by initiator element (Inr) (Y-+1-N-T/A-Y-Y
(where Y is a pyrimidine,+1 is TSS, N is any nucleotide) and the downstream
promoter element (DPE).
• DPE: [(A/G)+28G(A/T) (C/T)(G/A/C)+32.] downstream from the TSS.
• GC Box: Enhancer sequences (GGGCGG) occurring upstream TSS.
• Based on their distance from transcription start site (TSS), the regions of the
promoter have been termed the core promoter (235-135), proximal promoter and
distal promoter.
• Enhancer: DNA sequences that can be bound by proteins (activators) to increase
the rate of transcription.
• Silencers: DNA sequences that can bound to the proteins (Repressor) to
stop/decrease rate of transcription.
• Insulators: Cis-regulatory element known as a long-range regulatory element,
found over distances from the promoter element. Insulators contain clustered
binding sites for sequence specific DNA-binding proteins and mediate intra- and
inter-chromosomal interactions, thereby functioning either as an enhancer-
blocker or a barrier, or both.
22. 3/-Flanking Region of Transcribed Genes
A region of DNA which is NOT copied into the mature mRNA, but
which is present adjacent to 3/ end of the gene.
It may contain the transcription termination signal.
In Eukaryotes, transcription termination is facilitated by a number of
protein factors (Cleavage and Polyadenylation Specificity Factor
(CPSF), Cleavage Stimulation Factor (CStF), etc.) that become
associated with the pol II as soon as the enzyme leaves the promoter.
23. Ribonucleic acid (RNA)
• RNA is a polymeric molecule essential in various biological roles in
coding, decoding, regulation, and expression of genes.
• Three types of RNA associated with protein synthesis: ribosomal
RNA (rRNA), messenger RNA (mRNA), and transfer RNA (tRNA),
of which rRNA and tRNA are noncoding, whereas mRNA is protein
coding.
• Long noncoding RNAs (lncRNAs): snRNA (small nuclear RNA),
snoRNA (small nucleolar RNA), gRNA (guide RNA), Xist (X
inactive-specific transcript) and Tsix (an antisense regulator of Xist).
• snRNAs are essential for mRNA splicing, snoRNAs are important in
methylation of rRNAs, gRNAs are essential in RNA editing,
whereas Xist, Tsix, are involved in the epigenetic regulation of gene
expression
• Small Noncoding RNA (20-30nt long), are powerful regulators of
gene expression. Examples include microRNA (miRNA,
abbreviated as miR), small interfering RNA (siRNA), and Piwi-
interacting RNA (piRNA).
24. Nucleic acid sequence
• A nucleic acid sequence is a succession of letters that indicate the
order of nucleotides forming alleles within a DNA (using GACT)
or RNA (GACU) molecule.
• By convention, sequences are usually presented from the 5' end to
the 3' end.
• For DNA, the sense strand is used.
• The possible letters are A, C, G, and T, representing the four
nucleotide bases of a DNA strand — adenine, cytosine, guanine,
thymine — covalently linked to a phosphodiester backbone.
• In the typical case, the sequences are printed abutting one another
without gaps, as in the sequence AAAGTCTGAC, read left to right
in the 5' to 3' direction.
• With regards to transcription, a sequence is on the coding strand if
it has the same order as the transcribed RNA.
25.
26. Translation
• Translation is the final step of central dogma i.e. synthesis of proteins
directed by a mRNA template.
• The information contained in the mRNA is read as three letter words
(triplets), called codons.
• During translation amino acids are linked together to form a polypeptide
chain which will later be folded into a protein.
• The tRNA, carries an amino acid at one end and has a triplet of
nucleotides, an anticodon, at the other end.
• The anticodon of a tRNA molecule can base pair, i.e form chemical
bonds, with the mRNA's three letter codon.
• Thus the tRNA acts as the translator between mRNA and protein by
bringing the specific amino acid coded for by the mRNA codon.
• Several regions of the mRNA are not translated into protein, including the
5' and 3' UTRs.
• The 5‘ UTR is called leader sequence while 3' UTR is called trailer
sequence.
• 5' UTR contains a sequence that is recognized by the ribosome to bind the
mRNA and initiate translation.
• The 3' UTR is found immediately following the translation stop codon
and plays a critical role in translation termination as well as post-
transcriptional gene expression.
27. Proteins and Amino Acids
• Proteins (polypeptides) are translated from the mRNA, which
carries the amino acid sequence information for the polypeptide.
• Translation proceeds from the N-terminal to C-terminal direction of
the polypeptide being synthesized. Proteins are made up of
structural units called amino acids. All amino acids are α-amino
acids.
• They are called α-amino acids because the amino group (α-NH2) is
attached to the α-carbon atom—that is, the carbon atom linked to
the carbonyl carbon of the carboxyl group (α-COOH).
28.
29. Protein Structure
1. Primary Structure: The sequence of residues linked
together via peptide bond to make up the protein.
2. Secondary Structure: Representing the local folding
pattern of the polypeptide backbone stabilized by hydrogen
bonds between N-H and C=O groups. e.g. ά helix and the ß
sheet.
3. Tertiary Structure: The folding and refolding of secondary
structure in the 3D arrangement of a polypeptide chain,
including ά helix, the ß sheets, and any other loops and
folds.
4. Quaternary structure: The number and arrangement of
the individual polypeptide chains
30.
31. Structural Components of a Protein
• The N-terminus (amino-terminus, NH2-terminus, N-terminal end or
amine-terminus): The start of a protein or polypeptide referring to the free
amine group (-NH2) located at the end of a polypeptide.
• C-terminus (carboxyl-terminus, carboxy-terminus, C-terminal tail, C-
terminal end, or COOH-terminus): The end of an amino acid chain
(protein or polypeptide), terminated by a free carboxyl group (-COOH).
• Loops and turns: Connect α helices and β strands. The most common types
cause a change in direction of the polypeptide chain allowing it to fold back
on itself to create a more compact structure.
• Motif: An element of structure or pattern that recurs in many contexts ;
specifically, a small structural domain that can be recognized in a variety of
proteins.
• Domain: A distinct structural unit of a polypeptide, which may be encoded
separately by a specific exon; domains may have separate functions and
may fold as independent, compact units. Large globular proteins often
consist of several domains, which are connected to each other by stretches
of relatively extended polypeptide.
32. Mutations
Heritable changes in the nucleotide sequence of genomic DNA/ RNA
that produce a mutant protein after transcription and translation.
Types of Mutations
1. Point Mutations: These mutations involve a base substitution,
occurs when a single nucleotide is replaced with a different
nucleotide.
i. Silent Mutation: causes no change in the activity of the protein
ii. Missense Mutation: A nucleotide substitution that changes a codon
so that it codes for a different amino acid in the protein.
iii. Nonsense Mutation: A nonsense mutation is the same as a missense
mutation except the resulting codon codes for a STOP signal.
2. Frameshift mutations: are caused by the insertion or a deletion of a
base pair. An inserted or deleted nucleotide alters the triplet grouping of
nucleotides into codons and shifts the reading frame so that all
nucleotides downstream from the mutation will be improperly grouped.
33. Important Terminologies
• Homolog: A gene related to a second gene by descent from a
common ancestral DNA sequence. The term, homolog, may
apply to the relationship between genes separated by the event
of speciation or to the relationship between genes separated by
the event of genetic duplication.
• Ortholog: Orthologs are genes in different species that evolved
from a common ancestral gene by speciation. Normally,
orthologs retain the same function in the course of evolution.
Identification of orthologs is critical for reliable prediction of
gene function in newly sequenced genomes.
• Speciation: Speciation is the origin of a new species capable of
making a living in a new way from the species from which it
arose. As part of this process it has also acquired some barrier
to genetic exchange with the parent species.
• Paralog: Paralogs are genes related by duplication within a
genome. Orthologs retain the same function in the course of
evolution, whereas paralogs evolve new functions, even if
these are related to the original one.
34. Important Terminologies
Homologous proteins: Proteins having sequences and functions
similar in different species, for example, the hemoglobins.
Homology: Similarity in structure of an organ or a molecule, reflecting
a common evolutionary origin, specifically such a similarity in protein
or nucleic acid sequence; structures related by homology are
homologous and are called as homologues.
Analog: A similar gene or protein that does not reflect a common
evolutionary origin.
Genotype: At a specific locus there is an allele in each of the two
homologous chromosomes. The two alleles together are called
genotype.
Haplotype: Sequence of alleles along a chromosome
Phenotype: Observable feature/triat, such as height, color of eye, etc.