PROTEIN
SEQUENCING
Introduction to Protein
Sequencing
What is Protein
• Any of a class of nitrogenous organic compounds which
have large molecules composed of one or more long
chains of amino acids and are an essential part of all
living organisms, especially as structural components of
body tissues such as muscle, hair, etc., and as enzymes
and antibodies."a protein found in wheat"
What is sequence
• a particular order in which related things follow each other.
• a set of related events, movements, or items that follow
each other in a particular order.
Protein Sequencing
• Protein sequencing is the practical process of
determining the amino acid sequence of all or part of a
protein or peptide. This may serve to identify the protein
or characterize its post-translational modifications.
• Typically, partial sequencing of a protein provides
sufficient information (one or more sequence tags) to
identify it with reference to databases of protein
sequences derived from the conceptual translation of
genes.
• The two major direct methods of protein sequencing are
mass spectrometry and Edman degradation using a
protein sequenator (sequencer). Mass spectrometry
methods are now the most widely used for protein
sequencing and identification but Edman degradation
remains a valuable tool for characterizing a protein's N-
terminus.
Why we do Protein sequencing??
• Determining amino acid composition.
• It is often desirable to know the unordered amino acid
composition of a protein prior to attempting to find the
ordered sequence, as this knowledge can be used to
facilitate the discovery of errors in the sequencing process
or to distinguish between ambiguous results
• . Knowledge of the frequency of certain amino acids may
also be used to choose which protease to use for
digestion of the protein. The disincorporation of low levels
of non-standard amino acids (e.g. norleucine) into
proteins may also be determined.
• A generalized method often referred to as amino acid
analysis for determining amino acid frequency is as
follows:
• Hydrolyse a known quantity of protein into its constituent
amino acids.
• Separate and quantify the amino acids in some way.
• Hydrolysis
• Hydrolysis is done by heating a sample of the protein in 6
M hydrochloric acid to 100–110 °C for 24 hours or longer.
Proteins with many bulky hydrophobic groups may require
longer heating periods. However, these conditions are so
vigorous that some amino acids (serine, threonine,
tyrosine, tryptophan, glutamine, and cysteine) are
degraded. To circumvent this problem,
• Biochemistry Online suggests heating separate samples
for different times, analysing each resulting solution, and
extrapolating back to zero hydrolysis time. Rastall
suggests a variety of reagents to prevent or reduce
degradation, such as thiol reagents or phenol to protect
tryptophan and tyrosine from attack by chlorine, and pre-
oxidising cysteine. He also suggests measuring the
quantity of ammonia evolved to determine the extent of
amide hydrolysis.
• Separation and quantitation
• The amino acids can be separated by
ion-exchange chromatography then derivatized to
facilitate their detection. More commonly, the amino acids
are derivatized then resolved by reversed phase HPLC.
• An example of the ion-exchange chromatography is given
by the NTRC using sulfonated polystyrene as a matrix,
adding the amino acids in acid solution and passing a
buffer of steadily increasing pH through the column.
Amino acids are eluted when the pH reaches their
respective isoelectric points. Once the amino acids have
been separated, their respective quantities are
determined by adding a reagent that will form a coloured
derivative.
HISTORY OF PROTEIN
SEQUENCING
• The advent of protein sequencing can be traced to two
almost parallel discoveries by Frederick Sanger and
Pehr Edman.
• In 1950, Pehr Edman published a paper demonstrating a
label-cleavage method for protein sequencing which was
later termed “Edman degradation”.
• Pehr Edman began his work in the Northrop-Kunitz
laboratory at the Princeton branch of the Rockefeller
Institute of Medical Research in 1947
• where he attempted to find a method to decode the
amino acid sequence of a protein using chemicals;
specifically he had early success with
• fluorodinitrobenzene (FDNB) and phenylisothiocyanate
(PITC).
• Throughout his year at Princeton, Edman was able to
conduct enough experiments to understand that it was
feasible to use reagents like FDNB and PITC to determine
amino acid sequence.
• Edman returned to Sweden in 1947 and after two more
years of work he was able to publish his paper that would
describe the first successful method to sequence proteins [1]
• This ground breaking paper described a method to
determine the amino acid sequence of a protein and would
come to be known as the Edman Degradation.
F.SANGER
• Around the same time Fred Sanger was developing his
own labeling and separation method which led to the
sequencing of insulin.
• For this work, Sanger was awarded the 1958 Nobel Prize
for Chemistry.
Plus and minus in the 1970’s
• Fast-forward once again to the 1970’s and we find Fred
Sanger still at the forefront of nucleic acid sequencing.
• In 1975 whilst at the Laboratory of Molecular Biology in
Cambridge, Fred Sanger developed the “plus and minus”
method for DNA sequencing (Sanger and Coulson, 1975).
• Again there was competition in the field with Maxam and
Gilbert working on degradation sequencing (Maxam and
Glibert, 1977) however, their method was ultimately to
falter due to the ease and quality of the Sanger method.
plus and minus method
• A primer is extended by a polymerase to generate a population
of newly synthesized deoxyribonucleotides of assorted lengths;
the unused dNTPs are removed, and polymerization continues
in four pairs of plus and minus reaction mixtures; the minus
mixtures have three NTPs and the plus mixtures have only one.
• After a second polymerization, the mixtures are fractionated by
gel electrophoresis, and each plus and minus pair is compared
to indicate the length of the new polydeoxyribonucleotide (by
the mobilities of the bands) and the position at which
polymerization had terminated as a result of the absence of the
missing dNTP
• Five years earlier, Frederick Sanger had demonstrated a
method to determine the amino acid residue located on
the N-terminal end of a polypeptide chain by using the
reagent fluorodinitrobenzene.
• While it was thought, that at most, this method could only
provide the sequences found on the N-terminal,
• Sanger was able to take the method one step further.
• By using several proteolytic enzymes, partial hydrolysis
and early version of chromatography, Sanger was able to
cleave the protein into fragments and piece together the
residues like a jigsaw puzzle.
• It wasn’t until 1955 that Sanger was able to present the
complete sequence of insulin which led to him being
awarded a Nobel Prize in Chemistry in 1958.
Other scientist
• Emile Zuckerkandl and Linus Pauling, whose work in the
mid1960s advanced the use of nucleotide and protein
sequences to explore evolution
• In the 1970s,Carl Woese used ribosomal RNA sequences
to define archaebacteria as a group of living organisms
distinct from other bacteria and eukaryotes
Methods Of Protein
Sequencing
Protein sequencing
• Technique to find out the sequence of amino acids in a
protein
Sequencing methods
1-N-terminal sequencing
(Edman degradation)
2-C-terminal sequencing
3-Prediction from DNA sequence
EDMAN DEGRADATION
N-terminal sequencing
STEPS
• Protein purification
• Protein denaturation
• Protein digestion
• N-terminal labeling
• Separation of labeled amino acid by chromatography
• Detection through mass spectrometry
• Data analysis
Protein isolation(purification)
• 1-SDS-PAGE
(sodium dodecyl sulfate-poly
acryl amide gel)
2-Two dimensional gels
Protein of interest is
immobilized by being
absorbed onto a chemically
modified glass or by electro
blotting onto a porous
polyvinylidene fluoride
(PVDF) membrane.
Protein hydrolysis(denaturation)
by heating a sample of the
protein in 6 Molar HCL up
to 100-110 degrees Celsius
for 24 hours or longer
It may degrade some amino
acids
To avoid this
Thiol reagents or phenol are
used
- Performic acid for intra
chain or inter chain S-S
bonds
Protein digestion
• Use Endoproteinase Lys-C, CNBr, Pepsin or trypsin to
digest proteins into a population of peptides
• Other enzymes include Glu-C and chymotrypsin
• Add enzyme at 1:20 enzyme: protein ratio
• incubate at room temperature for 6-9hrs
• For better results use mixture of enzymes
N-terminal labeling
• The Edman reagent, phenylisothiocyanate (PTC), is
added to the adsorbed peptide, together with a mildly
basic buffer solution of 12% trimethylamine
• This reacts with the amine group of the N-terminal amino
acid
• The terminal amino acid can then be selectively detached
by the addition of anhydrous acid
• The derivative then isomerises to give a substituted
phenylthiohydantoin which can be washed off and
identified by chromatography, and the cycle can be
repeated
CHROMATOGRAPHY
• Chromatography is a
technique in which molecules
are separated based on
volatility and bond
characteristics when
subjected to a carrier
• Derivatives of amino acid can
be separated by
• 1-HPLC
• 2-Gas chromatography
• In gas chromatography (GC),
the mobile phase is an inert
gas such as helium
MASS SPECTROMETERY
• Mass spectrometry (MS) is an analytical technique that
measures the mass-to-charge ratio of charged particles
• The MS principle consists of ionizing chemical
compounds to generate charged molecules or molecule
fragments and measuring their mass-to-charge ratios
• Separated amino acid derivatives are analyzed by mass
spectrometer
MS procedure
• A sample is loaded onto the MS instrument, and
undergoes vaporization
• The components of the sample are ionized by one of a
variety of methods (e.g., by impacting them with an
electron beam), which results in the formation of charged
particles (ions)
• The ions are separated according to their mass-to-charge
ratio in an analyzer by electromagnetic fields
• The ions are detected, usually by a quantitative method
• The ion signal is processed into mass spectra
Mass spectrometer
MS data analysis
• first strategy for
identifying an unknown
compound is to compare
its experimental mass
spectrum against a
library of mass spectra
• Standard solutions of
amino acids are also
used and the resulting
pattern is compared with
standard spectrum.
Limitations of Edman degradation
• Need Pure Samples of Peptides
• Requires 40-60 min / Amino Acid
• Can’t Analyze N-Terminally Modified Peptides
•Advantages
•Most Reliable Sequencing Technique
C terminal sequence
C terminal
Definition:
The C-terminus (also known as
the carboxyl-terminus, carboxyl-
terminus, C-terminal tail, C-terminal end,
or COOH-terminus) is the end of an
amino acid chain (protein or polypeptide),
terminated by a free carboxyl group (-
COOH).
C-terminal retention signals
• Proteins are naturally synthesized starting from
the N-terminus and ending at the C-terminus.
• While the N-terminus of a protein often contains
targeting signals, the C-terminus can contain
retention signals for protein sorting.
• The most common ER retention signal is the
amino acid sequence -KDEL (Lys-Asp-Glu-Leu)
or -HDEL (His-Asp-Glu-Leu) at the C-terminus.
This keeps the protein in the
endoplasmic reticulum and prevents it from
entering the secretory pathway.
C-terminal modifications
• The C-terminus of proteins can be modified post
translationally, most commonly by the addition of a
lipid anchor to the C-terminus that allows the protein
to be inserted into a membrane without having a
trans membrane domain.
• Another form of C-terminal modification is the
addition of a phosphoglycan,
glycosylphosphatidylinositol (GPI), as a membrane
anchor. The GPI anchor is attached to the C-terminus
after proteolytic cleavage of a C-terminal propeptied.
The most prominent example for this type of
modification is the prion protein.
C-terminal domain:
• The C-terminal domain of some proteins has
specialized functions. In humans, the CTD
of RNA polymerase II typically consists of up
to 52 repeats of the sequence Tyr-Ser-Pro-
Thr-Ser-Pro-Ser.[1]
This allows other proteins
to bind to the C-terminal domain of RNA
polymerase in order to activate polymerase
activity. These domains then involved in the
initiation of DNA transcription.
C terminal sequencing technique
• Top Down sequencing by MALDI ISD is used to
sequence the c terminal of amino acid chain.
• MALDI MS: “matrix-assisted laser
desorption/ionization mass spectrometry” through
which the c-terminal can be analyzed.
• This method is used when the N-terminal is
blocked and there is only C-terminal available.
• The technique can fragment and sequence both
the N- and C-terminal in the same mass spectrum.
• Admen degradation is only used for N-terminal
sequencing.
• The most common method is to add carboxy
peptidases to a solution of the protein.
• Take a sample at regular at regular intervals and
determine the terminal amino acid by analyzing a plot
amino acid concentration and time.
Use of peptidase:
• A peptide mixture is generated by cleavage of the
protein with cyanogen bromide and is incubated with
carboxy peptidase Y.
• The enzyme is only able to act on the C-terminal
fragment, because this is the only peptide without a
homoserine lactone residue at its C terminus.
• The resulting fragments, forming a peptide ladder, are
analyzed by matrix-assisted laser desorption/ionization
mass spectrometry (MALDI-MS).
• The entire protocol, including the CNBr cleavage, takes
21 h and can be applied to proteins purified either by
SDS-PAGE or by 2D PAGE or in solution.
Top down sequencing:
• Top-down proteomics is a method of protein
identification that uses an ion trapping
mass spectrometer to store an isolated protein ion for
mass measurement and tandem mass spectrometry
analysis.
• Top-down proteomics is capable of identifying and
quantitating unique proteoforms through the analysis of
intact protein.
• Top-down proteomics interrogates protein structure
through measurement of an intact mass followed by direct
ion dissociation in the gas phase.
• Fragmentation for tandem mass spectrometry
is accomplished by
electron-capture dissociation or
electron-transfer dissociation. Effective
fractionation is critical for sample handling
before mass-spectrometry-based proteomics.
Proteome analysis routinely involves digesting
intact proteins followed by inferred protein
identification using mass spectrometry.
• The main advantages of the top-down approach
include the ability to detect degradation
products, sequence variants, and combinations
of post-translational modifications.
MALDI MS top down sequencing:
• 0.5-1 ml salt-free protein solution placed on a MALDI-
plate, covered with the MALDI Matrix solution, is analyzed
in the in-source decay mode on an UltrafleXtreme mass
spectrometer. The generated mass spectrum (a complex
mass spectrum, exhibiting mainly c- and y ions) is further
analyzed with the software Bio. tools or is processed via a
Mascot search.
• A .pdf result file with sequence coverage of the target
sequence would be the result output.
UltrafleXtreme mass spectrometer
• The UTX is used for a variety
of MALDI applications,
including mass spectrometry
imaging (MSI), protein
identification, peptide
fingerprinting, and structure
identification for a wide
spectrum of biomoles
(including lipids, polymers,
glycans).
MALDI ISD
PEPTIDE SEQUENCING BY
MASS SPECTROMETRY
Introduction
• MS/MS plays important role in protein identification (fast and
sensitive)
• Derivation of peptide sequence an important task in
proteomics
• Derivation without help from a protein database (“de novo
sequencing”), especially important in identification of
unknown protein
Basic lab experimental steps
1. Proteins digested w/ an enzyme to produce peptides
2. Peptides charged (ionized) and separated according to their different m/z
ratios
3. Each peptide fragmented into ions and m/z values of fragment ions are
measured
• Steps 2 and 3 performed within a tandem mass spectrometer.
Mass spectrum
• Proteins consist of 20 different types of a. a. with different masses
(except for one pair Leu and Ile)
• Different peptides produce different spectra
• Use the spectrum of a peptide to determine its sequence
Objectives
• Describe the steps of a typical peptide analysis by MS (proteomic
experiment)
• Explain peptide ionization, fragmentation, identification
Why are peptides, and not proteins, sequenced?
• Solubility under the same conditions
• Sensitivity of MS much higher for peptides
• MS efficiency
MS Peptide Experiment
Choice of Enzyme
Cleaving
agent/Proteases
Specificity
A. HIGHLY SPECIFIC
Trypsin Arg-X, Lys-X
Endoproteinase Glu-C Glu-X
Endoproteinase Lys-C Lys-X
Endoproteinase Arg-C Arg-X
Endoproteinase Asp-N X-Asp
B. NONSPECIFIC
Chymotrypsin Phe-X, Tyr-X, Trp-X, Leu-X
Thermolysin X-Phe, X-Leu, X-Ile, X-Met, X-Val, X-Ala
ESI
Liquid flow
Q or Ion Trap
analyzer
ESI is a solution technique that gives a continuous stream of ions,
best for quadrupoles, ion traps, etc.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+ ++
+
++
+ ++
+
++
+ ++
+
++
+ ++
+
++
+ ++
+
++
+ ++
+
++
+ ++
+
++
+ ++
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
MALDI
3 nS LASER PULSE
Sample (solid) on target at
high voltage/ high vacuum
MALDI is a solid-state technique that gives ions in pulses,
best suited to time-of-flight MS.
TOF analyzer
Atmosphere Low vac. High vac.
High vacuum
….MALDI or Electrospray ?
MALDI is limited to solid state, ESI to liquid
ESI is better for the analysis of complex mixture as it is directly interfaced to
a separation techniques (i.e. HPLC or CE)
MALDI is more “flexible” (MW from 200 to 400,000 Da)
Q2
Collision Cell
Q3
I
II
III
Correlative
sequence database
searching
Theoretical Acquired
Protein identification
Peptides
1D, 2D, 3D peptide separation
200 400 600 80010001200
m/z
200 400 600 80010001200
m/z
200 400 600 80010001200
m/z
12 14 16
Time (min)
Tandem mass spectrum
Protein Identification Strategy
Q1
*
*
Protein
mixture
10-Mar-200514:28:10
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
m/z
0
100
%
CAL050310A 71 (1.353) Cm (1:96) TOF MSMS785.60ES+
2.94e3
684.17
333.15
187.07
175.12
169.06
246.13
286.11
480.16
382.11
480.08
497.09
627.17
612.08
498.09
813.16
785.62
685.18
740.09
1285.14
1056.17
942.16
814.17
924.16
943.17
1039.13
1038.17
1171.14
1057.18
1058.17
1172.15
1173.16
1286.14
1287.13
1296.10
10-Mar-200514:28:10
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
m/z
0
100
%
CAL050310A 71 (1.353) Cm (1:96) TOF MSMS785.60ES+
2.94e3
684.17
333.15
187.07
175.12
169.06
246.13
286.11
480.16
382.11
480.08
497.09
627.17
612.08
498.09
813.16
785.62
685.18
740.09
1285.14
1056.17
942.16
814.17
924.16
943.17
1039.13
1038.17
1171.14
1057.18
1058.17
1172.15
1173.16
1286.14
1287.13
1296.10
10-Mar-200514:28:10
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
m/z
0
100
%
CAL050310A 71 (1.353) Cm (1:96) TOF MSMS785.60ES+
2.94e3
684.17
333.15
187.07
175.12
169.06
246.13
286.11
480.16
382.11
480.08
497.09
627.17
612.08
498.09
813.16
785.62
685.18
740.09
1285.14
1056.17
942.16
814.17
924.16
943.17
1039.13
1038.17
1171.14
1057.18
1058.17
1172.15
1173.16
1286.14
1287.13
1296.10
Breaking Protein into Peptides and Peptides into Fragment Ions
•Proteases, e.g. trypsin, break protein into
peptides
•MS/MS breaks the peptides down into fragment
ions and measures the mass of each piece
•MS measure m/z ratio of an ion
Peptide fragmentation
Amino acids
differ in their
side chains
Predominant
fragmentation
Weakest bonds
Tendency of peptides to fragment at Asp (D)
Mass Spectrometry in Proteomics
Ruedi Aebersold* and David R. Goodlett
269 Chem. Rev. 2001, 101, 269-295
C-terminal side of Asp
Protein Identification by MS
Artificial
spectra built
Artificially
trypsinated
Database of
sequences
(i.e. SwissProt)
Spot removed
from gel
Fragmented
using trypsin
Spectrum of
fragments
generated
MATCH
Library
Conclusions
• MS of peptides enables high throughput identification and
characterization of proteins in biological systems
• “de novo sequencing” can be used to identify unknown proteins not
found in protein databases
Prediction From DNA Sequence
• The rapid increase of publicly available sequences and protein
structures means that an increasing amount of information can
be obtained for any protein sequence through its relatedness to
others.
• If a set of homologous proteins can be found and aligned, the
information content at each position in the alignment profile is far
greater than in any single member of the family, and any
structural or functional prediction algorithm should utilize this
collective information. Profile information of this type is extremely
sensitive to the quality of the multiple alignment, and distant
homologues should only be included in the alignment if they can
be aligned with confidence.
Figure17.3b-3
RNA PROCESSING
Nuclear
envelope
DNA
Pre-mRNA
mRNA
TRANSCRIPTION
TRANSLATION Ribosome
Polypeptide
DNA
template
strand
TRANSCRIPTION
mRNA
TRANSLATION
Protein
Amino acid
Codon
Trp Phe Gly
5
5
Ser
U U U U U
3
3
5
3
G
G
G G C C
T
C
A
A
A
A
A
A
A
T T T T
T
G
G G G
C C C G G
DNA
molecule
Gene 1
Gene 2
Gene 3
C C
• Protein structure prediction is the inference of the
three-dimensional structure of a protein from its
amino acid sequence—that is, the prediction of its folding
and its secondary and tertiary structure from its
primary structure.
• Structure prediction is fundamentally different from the
inverse problem of protein design.
• Protein structure prediction is one of the most
important goals pursued by bioinformatics and
theoretical chemistry; it is highly important in
medicine (for example, in drug design) and
biotechnology (for example, in the design of
novel enzymes).
•Protein structure and terminology
• Proteins are chains of amino acids joined together by peptide bonds.
Many conformations of this chain are possible due to the rotation of the
chain about each Cα atom. It is these conformational changes that are
responsible for differences in the three dimensional structure of proteins.
• Each amino acid in the chain is polar, i.e. it has separated positive and
negative charged regions with a free C=O group, which can act as
hydrogen bond acceptor and an NH group, which can act as hydrogen
bond donor. These groups can therefore interact in the protein structure.
• The 20 amino acids can be classified according to the chemistry of the
side chain which also plays an important structural role. Glycine takes
on a special position, as it has the smallest side chain, only one
Hydrogen atom, and therefore can increase the local flexibility in the
protein structure. Cysteine on the other hand can react with another
cysteine residue and thereby form a cross link stabilizing the whole
structure.
• The protein structure can be considered as a sequence of
secondary structure elements, such as α helices and β sheets,
which together constitute the overall three-dimensional
configuration of the protein chain. In these secondary structures
regular patterns of H bonds are formed between neighboring
amino acids, and the amino acids have similar Φ and Ψ angles.
• Bond angles for ψ and ω
• The formation of these structures neutralizes the polar groups
on each amino acid. The secondary structures are tightly
packed in the protein core in a hydrophobic environment. Each
amino acid side group has a limited volume to occupy and a
limited number of possible interactions with other nearby side
chains, a situation that must be taken into account in molecular
modeling and alignments. [
• α Helix
• The α helix is the most abundant type of secondary structure
in proteins. The α helix has 3.6 amino acids per turn with an H
bond formed between every fourth residue; the average length
is 10 amino acids (3 turns) or 10 Å but varies from 5 to 40 (1.5
to 11 turns).
• The alignment of the H bonds creates a dipole moment for the
helix with a resulting partial positive charge at the amino end
of the helix. Because this region has free NH2 groups, it will
interact with negatively charged groups such as phosphates.
• The most common location of α helices is at the surface of
protein cores, where they provide an interface with the
aqueous environment. The inner-facing side of the helix tends
to have longer helices, forming a bend.
• hydrophobic amino acids and the outer-facing side
hydrophilic amino acids.
• Thus, every third of four amino acids along the chain will
tend to be hydrophobic, a pattern that can be quite readily
detected. In the leucine zipper motif, a repeating pattern
of leucines on the facing sides of two adjacent helices is
highly predictive of the motif.
• β sheet
• β sheets are formed by H bonds between an average of
5–10 consecutive amino acids in one portion of the chain
with another 5–10 farther down the chain.
• The interacting regions may be adjacent, with a short loop
in between, or far apart, with other structures in between.
Every chain may run in the same direction to form a
parallel sheet, every other chain may run in the reverse
chemical direction to form an anti parallel sheet, or the
chains may be parallel and anti parallel to form a mixed
sheet.
• The pattern of H bonding is different in the parallel and
anti parallel configurations. Each amino acid in the interior
strands of the sheet forms two H bonds with neighboring
amino acids, whereas each amino acid on the outside
strands forms only one bond with an interior strand
• . Looking across the sheet at right angles to the strands,
more distant strands are rotated slightly counterclockwise
to form a left-handed twist. The Cα atoms alternate above
and below the sheet in a pleated structure, and the R side
groups of the amino acids alternate above and below the
pleats.
• Loop
• Loops are regions of a protein chain that are
• (1) between α helices and β sheets,
• (2) of various lengths and three-dimensional
configurations, and
• (3) on the surface of the structure.
• Hairpin loops that represent a complete turn in the
polypeptide chain joining two antiparallel β strands may
be as short as two amino acids in length.
• Loops interact with the surrounding aqueous
environment and other proteins. Because amino
acids in loops are not constrained by space and
environment as are amino acids in the core
region, and do not have an effect on the
arrangement of secondary structures in the core,
more substitutions, insertions, and deletions may
occur. Thus, in a sequence alignment, the
presence of these features may be an indication
of a loop.
• The positions of introns in genomic DNA sometimes
correspond to the locations of loops in the encoded
protein[
. Loops also tend to have charged and polar amino
acids and are frequently a component of active sites. A
detailed examination of loop structures has shown that
they fall into distinct families.
• Coils
• A region of secondary structure that is not a α helix, a β
sheet, or a recognizable turn is commonly referred to as a
coil.
Applications of
Protein
Sequencing
• In Functional genomics:
functional genomics is a field of molecular biology that
attempts to make use of the vast wealth of data produced
by genomic and transcriptomic projects (such as genome
sequencing projects and RNA sequencing) to
describe gene (and protein) functions and interactions.
Unlike genomics, functional genomics focuses on the
dynamic aspects such as
• gene transcription, translation, regulation of gene
expression and protein–protein interactions, as opposed
to the static aspects of the genomic information such
as DNA sequence or structures.
• The goal of functional genomics is to understand the
relationship between an organism's genome and
its phenotype. The term functional genomics is often used
broadly to refer to the many possible approaches to
understanding the properties and function of the entirety
of an organism's genes and gene products.
• The promise of functional genomics is to expand and
synthesize genomic and proteomic knowledge into an
understanding of the dynamic properties of an organism
at cellular and/or organismal levels. This would provide a
more complete picture of how biological function arises
from the information encoded in an organism's genome.
• The possibility of understanding how a particular mutation
leads to a given phenotype has important implications for
human genetic diseases, as answering these questions
could point scientists in the direction of a treatment or
cure.
Prediction of protein function from
protein sequence and structure
The sequence of a genome contains the plans of the
possible life of an organism, but implementation of genetic
information depends on the functions of the proteins and
nucleic acids that it encodes. Many individual proteins of
known sequence and structure present challenges to the
understanding of their function.
• In particular, a number of genes responsible for
diseases have been identified but their specific
functions are unknown. Whole-genome sequencing
projects are a major source of proteins of unknown
function. Annotation of a genome involves
assignment of functions to gene products, in most
cases on the basis of amino-acid sequence alone.
3D structure can aid the assignment of function, motivating
the challenge of structural genomics projects to make
structural information available for novel uncharacterized
proteins. Structure-based identification of homologues often
succeeds where sequence-alone-based methods fail,
because in many cases evolution retains the folding pattern
long after sequence similarity becomes undetectable.
• Nevertheless, prediction of protein function from sequence and
structure is a difficult problem, because homologous proteins often
have different functions. Many methods of function prediction rely
on identifying similarity in sequence and/or structure between a
protein of unknown function and one or more well-understood
proteins. Alternative methods include inferring conservation
patterns in members of a functionally uncharacterized family for
which many sequences and structures are known.
In Proteomics
Proteomics is the large-scale study of proteomes. A
proteome is a set of proteins produced in an organism,
system, or biological context.
The proteome is not constant; it differs from cell to cell
and changes over time. To some degree, the proteome
reflects the underlying transcriptome. However, protein
activity (often assessed by the reaction rate of the
processes in which the protein is involved) is also
modulated by many factors in addition to the expression
level of the relevant gene.
Protein sequencing denotes the process of finding the
amino acid sequence, or primary structure of a protein.
Sequencing plays a very vital role in Proteomics as the
information obtained can be used to deduce function,
structure, and location which in turn aids in identifying new
or novel proteins as well as understanding of cellular
processes. Better understanding of these processes allows
for creation of drugs that target specific metabolic pathways
among other things.
In Bioinformatics
What is bioinformatics?
In recent years, molecular biology has witnessed
an information revolution as a result of the
development of rapid DNA sequencing techniques
and the corresponding progress in computer-based
technologies, which are allowing us to cope with
this information deluge in increasingly efficient
ways. The term that was coined to encompass
computer applications in biological sciences
is bioinformatics.
The term bioinformatics is now used to
mean rather different things, from artificial
intelligence and robotics to genome
analysis. The term was originally applied to
the computational manipulation and
analysis of biological sequence data (DNA
and/or protein), but now tends also to be
used to embrace the manipulation and
analysis of 3D structural data.
Identifying protein-coding genes in
genomic sequences
The vast majority of the biology of a newly sequenced
genome is inferred from the set of encoded proteins.
Predicting this set is therefore invariably the first step after
the completion of the genome DNA sequence.
The genome sequence is an organism's blueprint: the set of
instructions dictating its biological traits. The unfolding of
these instructions is initiated by the transcription of the DNA
into RNA sequences. According to the standard model, the
majority of RNA sequences originate from protein-coding
genes; that is, they are processed into messenger RNAs
(mRNAs) which, after their export to the cytosol, are
translated into proteins.
•To Determine the protein folding
•Protein folding is the process by which a protein structure
assumes its functional shape or conformation.
•Protein folding is the physical process by which
a protein chain acquires its native 3-dimensional
structure, a conformation that is usually biologically
functional, in an expeditious and reproducible manner.
•It is the physical process by which a polypeptide folds
into its characteristic and functional three-dimensional
structure from random coil. Each protein exists as an
unfolded polypeptide or random coil when translated from
a sequence of mRNA to a linear chain of amino acids.
•All protein molecules are heterogeneous unbranched
chains of amino acids.
•By coiling and folding into a specific three-dimensional
shape they are able to perform their biological function.
•Proteins are formed from long chains of amino acids;
they exist in an array of different structures which often
dictate their functions. Proteins follow energetically
favorable pathways to form stable, orderly, structures; this
is known as the proteins’ native structure.
• Most proteins can only perform their various functions
when they are folded. Scientists believe that the
instructions for folding a protein are encoded in the
sequence. Researchers and scientists can easily
determine the sequence of a protein, but have not
cracked the code that governs folding .
In Drugs production
What is Protein Drug
A type of drugs made of protein. These drugs usually have large molecular
weight with protein characteristics.
structure of an unusual class of proteins called beta-peptides. Eventually, these
peptides could become the basis for drugs that are cheaper to manufacture than
existing protein-based pharmaceuticals and last longer in the body.
A drug's efficiency may be affected by the degree to which it binds to
the proteins within blood plasma. The less bound a drug is, the more efficiently
it can traverse cell membranes or diffuse. Common blood proteins that drugs
bind to are human serum albumin, lipoprotein, glycoprotein, α, β‚ and γ globulins
proteinsequencing powerppint presentation

proteinsequencing powerppint presentation

  • 2.
  • 3.
  • 4.
    What is Protein •Any of a class of nitrogenous organic compounds which have large molecules composed of one or more long chains of amino acids and are an essential part of all living organisms, especially as structural components of body tissues such as muscle, hair, etc., and as enzymes and antibodies."a protein found in wheat"
  • 5.
    What is sequence •a particular order in which related things follow each other. • a set of related events, movements, or items that follow each other in a particular order.
  • 6.
    Protein Sequencing • Proteinsequencing is the practical process of determining the amino acid sequence of all or part of a protein or peptide. This may serve to identify the protein or characterize its post-translational modifications.
  • 7.
    • Typically, partialsequencing of a protein provides sufficient information (one or more sequence tags) to identify it with reference to databases of protein sequences derived from the conceptual translation of genes.
  • 8.
    • The twomajor direct methods of protein sequencing are mass spectrometry and Edman degradation using a protein sequenator (sequencer). Mass spectrometry methods are now the most widely used for protein sequencing and identification but Edman degradation remains a valuable tool for characterizing a protein's N- terminus.
  • 9.
    Why we doProtein sequencing?? • Determining amino acid composition. • It is often desirable to know the unordered amino acid composition of a protein prior to attempting to find the ordered sequence, as this knowledge can be used to facilitate the discovery of errors in the sequencing process or to distinguish between ambiguous results
  • 10.
    • . Knowledgeof the frequency of certain amino acids may also be used to choose which protease to use for digestion of the protein. The disincorporation of low levels of non-standard amino acids (e.g. norleucine) into proteins may also be determined. • A generalized method often referred to as amino acid analysis for determining amino acid frequency is as follows: • Hydrolyse a known quantity of protein into its constituent amino acids. • Separate and quantify the amino acids in some way.
  • 11.
    • Hydrolysis • Hydrolysisis done by heating a sample of the protein in 6 M hydrochloric acid to 100–110 °C for 24 hours or longer. Proteins with many bulky hydrophobic groups may require longer heating periods. However, these conditions are so vigorous that some amino acids (serine, threonine, tyrosine, tryptophan, glutamine, and cysteine) are degraded. To circumvent this problem,
  • 12.
    • Biochemistry Onlinesuggests heating separate samples for different times, analysing each resulting solution, and extrapolating back to zero hydrolysis time. Rastall suggests a variety of reagents to prevent or reduce degradation, such as thiol reagents or phenol to protect tryptophan and tyrosine from attack by chlorine, and pre- oxidising cysteine. He also suggests measuring the quantity of ammonia evolved to determine the extent of amide hydrolysis.
  • 13.
    • Separation andquantitation • The amino acids can be separated by ion-exchange chromatography then derivatized to facilitate their detection. More commonly, the amino acids are derivatized then resolved by reversed phase HPLC.
  • 14.
    • An exampleof the ion-exchange chromatography is given by the NTRC using sulfonated polystyrene as a matrix, adding the amino acids in acid solution and passing a buffer of steadily increasing pH through the column. Amino acids are eluted when the pH reaches their respective isoelectric points. Once the amino acids have been separated, their respective quantities are determined by adding a reagent that will form a coloured derivative.
  • 15.
  • 16.
    • The adventof protein sequencing can be traced to two almost parallel discoveries by Frederick Sanger and Pehr Edman. • In 1950, Pehr Edman published a paper demonstrating a label-cleavage method for protein sequencing which was later termed “Edman degradation”.
  • 17.
    • Pehr Edmanbegan his work in the Northrop-Kunitz laboratory at the Princeton branch of the Rockefeller Institute of Medical Research in 1947 • where he attempted to find a method to decode the amino acid sequence of a protein using chemicals; specifically he had early success with • fluorodinitrobenzene (FDNB) and phenylisothiocyanate (PITC).
  • 19.
    • Throughout hisyear at Princeton, Edman was able to conduct enough experiments to understand that it was feasible to use reagents like FDNB and PITC to determine amino acid sequence. • Edman returned to Sweden in 1947 and after two more years of work he was able to publish his paper that would describe the first successful method to sequence proteins [1] • This ground breaking paper described a method to determine the amino acid sequence of a protein and would come to be known as the Edman Degradation.
  • 20.
    F.SANGER • Around thesame time Fred Sanger was developing his own labeling and separation method which led to the sequencing of insulin. • For this work, Sanger was awarded the 1958 Nobel Prize for Chemistry.
  • 22.
    Plus and minusin the 1970’s • Fast-forward once again to the 1970’s and we find Fred Sanger still at the forefront of nucleic acid sequencing. • In 1975 whilst at the Laboratory of Molecular Biology in Cambridge, Fred Sanger developed the “plus and minus” method for DNA sequencing (Sanger and Coulson, 1975). • Again there was competition in the field with Maxam and Gilbert working on degradation sequencing (Maxam and Glibert, 1977) however, their method was ultimately to falter due to the ease and quality of the Sanger method.
  • 23.
    plus and minusmethod • A primer is extended by a polymerase to generate a population of newly synthesized deoxyribonucleotides of assorted lengths; the unused dNTPs are removed, and polymerization continues in four pairs of plus and minus reaction mixtures; the minus mixtures have three NTPs and the plus mixtures have only one. • After a second polymerization, the mixtures are fractionated by gel electrophoresis, and each plus and minus pair is compared to indicate the length of the new polydeoxyribonucleotide (by the mobilities of the bands) and the position at which polymerization had terminated as a result of the absence of the missing dNTP
  • 24.
    • Five yearsearlier, Frederick Sanger had demonstrated a method to determine the amino acid residue located on the N-terminal end of a polypeptide chain by using the reagent fluorodinitrobenzene. • While it was thought, that at most, this method could only provide the sequences found on the N-terminal, • Sanger was able to take the method one step further.
  • 25.
    • By usingseveral proteolytic enzymes, partial hydrolysis and early version of chromatography, Sanger was able to cleave the protein into fragments and piece together the residues like a jigsaw puzzle. • It wasn’t until 1955 that Sanger was able to present the complete sequence of insulin which led to him being awarded a Nobel Prize in Chemistry in 1958.
  • 26.
    Other scientist • EmileZuckerkandl and Linus Pauling, whose work in the mid1960s advanced the use of nucleotide and protein sequences to explore evolution • In the 1970s,Carl Woese used ribosomal RNA sequences to define archaebacteria as a group of living organisms distinct from other bacteria and eukaryotes
  • 27.
  • 28.
    Protein sequencing • Techniqueto find out the sequence of amino acids in a protein Sequencing methods 1-N-terminal sequencing (Edman degradation) 2-C-terminal sequencing 3-Prediction from DNA sequence
  • 29.
  • 30.
    STEPS • Protein purification •Protein denaturation • Protein digestion • N-terminal labeling • Separation of labeled amino acid by chromatography • Detection through mass spectrometry • Data analysis
  • 31.
    Protein isolation(purification) • 1-SDS-PAGE (sodiumdodecyl sulfate-poly acryl amide gel) 2-Two dimensional gels Protein of interest is immobilized by being absorbed onto a chemically modified glass or by electro blotting onto a porous polyvinylidene fluoride (PVDF) membrane.
  • 32.
    Protein hydrolysis(denaturation) by heatinga sample of the protein in 6 Molar HCL up to 100-110 degrees Celsius for 24 hours or longer It may degrade some amino acids To avoid this Thiol reagents or phenol are used - Performic acid for intra chain or inter chain S-S bonds
  • 33.
    Protein digestion • UseEndoproteinase Lys-C, CNBr, Pepsin or trypsin to digest proteins into a population of peptides • Other enzymes include Glu-C and chymotrypsin • Add enzyme at 1:20 enzyme: protein ratio • incubate at room temperature for 6-9hrs • For better results use mixture of enzymes
  • 34.
    N-terminal labeling • TheEdman reagent, phenylisothiocyanate (PTC), is added to the adsorbed peptide, together with a mildly basic buffer solution of 12% trimethylamine • This reacts with the amine group of the N-terminal amino acid • The terminal amino acid can then be selectively detached by the addition of anhydrous acid • The derivative then isomerises to give a substituted phenylthiohydantoin which can be washed off and identified by chromatography, and the cycle can be repeated
  • 36.
    CHROMATOGRAPHY • Chromatography isa technique in which molecules are separated based on volatility and bond characteristics when subjected to a carrier • Derivatives of amino acid can be separated by • 1-HPLC • 2-Gas chromatography • In gas chromatography (GC), the mobile phase is an inert gas such as helium
  • 37.
    MASS SPECTROMETERY • Massspectrometry (MS) is an analytical technique that measures the mass-to-charge ratio of charged particles • The MS principle consists of ionizing chemical compounds to generate charged molecules or molecule fragments and measuring their mass-to-charge ratios • Separated amino acid derivatives are analyzed by mass spectrometer
  • 38.
    MS procedure • Asample is loaded onto the MS instrument, and undergoes vaporization • The components of the sample are ionized by one of a variety of methods (e.g., by impacting them with an electron beam), which results in the formation of charged particles (ions) • The ions are separated according to their mass-to-charge ratio in an analyzer by electromagnetic fields • The ions are detected, usually by a quantitative method • The ion signal is processed into mass spectra
  • 39.
  • 40.
    MS data analysis •first strategy for identifying an unknown compound is to compare its experimental mass spectrum against a library of mass spectra • Standard solutions of amino acids are also used and the resulting pattern is compared with standard spectrum.
  • 41.
    Limitations of Edmandegradation • Need Pure Samples of Peptides • Requires 40-60 min / Amino Acid • Can’t Analyze N-Terminally Modified Peptides •Advantages •Most Reliable Sequencing Technique
  • 42.
  • 43.
    C terminal Definition: The C-terminus(also known as the carboxyl-terminus, carboxyl- terminus, C-terminal tail, C-terminal end, or COOH-terminus) is the end of an amino acid chain (protein or polypeptide), terminated by a free carboxyl group (- COOH).
  • 45.
    C-terminal retention signals •Proteins are naturally synthesized starting from the N-terminus and ending at the C-terminus. • While the N-terminus of a protein often contains targeting signals, the C-terminus can contain retention signals for protein sorting. • The most common ER retention signal is the amino acid sequence -KDEL (Lys-Asp-Glu-Leu) or -HDEL (His-Asp-Glu-Leu) at the C-terminus. This keeps the protein in the endoplasmic reticulum and prevents it from entering the secretory pathway.
  • 46.
    C-terminal modifications • TheC-terminus of proteins can be modified post translationally, most commonly by the addition of a lipid anchor to the C-terminus that allows the protein to be inserted into a membrane without having a trans membrane domain. • Another form of C-terminal modification is the addition of a phosphoglycan, glycosylphosphatidylinositol (GPI), as a membrane anchor. The GPI anchor is attached to the C-terminus after proteolytic cleavage of a C-terminal propeptied. The most prominent example for this type of modification is the prion protein.
  • 47.
    C-terminal domain: • TheC-terminal domain of some proteins has specialized functions. In humans, the CTD of RNA polymerase II typically consists of up to 52 repeats of the sequence Tyr-Ser-Pro- Thr-Ser-Pro-Ser.[1] This allows other proteins to bind to the C-terminal domain of RNA polymerase in order to activate polymerase activity. These domains then involved in the initiation of DNA transcription.
  • 48.
    C terminal sequencingtechnique • Top Down sequencing by MALDI ISD is used to sequence the c terminal of amino acid chain. • MALDI MS: “matrix-assisted laser desorption/ionization mass spectrometry” through which the c-terminal can be analyzed. • This method is used when the N-terminal is blocked and there is only C-terminal available. • The technique can fragment and sequence both the N- and C-terminal in the same mass spectrum.
  • 49.
    • Admen degradationis only used for N-terminal sequencing. • The most common method is to add carboxy peptidases to a solution of the protein. • Take a sample at regular at regular intervals and determine the terminal amino acid by analyzing a plot amino acid concentration and time.
  • 50.
    Use of peptidase: •A peptide mixture is generated by cleavage of the protein with cyanogen bromide and is incubated with carboxy peptidase Y. • The enzyme is only able to act on the C-terminal fragment, because this is the only peptide without a homoserine lactone residue at its C terminus. • The resulting fragments, forming a peptide ladder, are analyzed by matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS). • The entire protocol, including the CNBr cleavage, takes 21 h and can be applied to proteins purified either by SDS-PAGE or by 2D PAGE or in solution.
  • 51.
    Top down sequencing: •Top-down proteomics is a method of protein identification that uses an ion trapping mass spectrometer to store an isolated protein ion for mass measurement and tandem mass spectrometry analysis. • Top-down proteomics is capable of identifying and quantitating unique proteoforms through the analysis of intact protein. • Top-down proteomics interrogates protein structure through measurement of an intact mass followed by direct ion dissociation in the gas phase.
  • 52.
    • Fragmentation fortandem mass spectrometry is accomplished by electron-capture dissociation or electron-transfer dissociation. Effective fractionation is critical for sample handling before mass-spectrometry-based proteomics. Proteome analysis routinely involves digesting intact proteins followed by inferred protein identification using mass spectrometry. • The main advantages of the top-down approach include the ability to detect degradation products, sequence variants, and combinations of post-translational modifications.
  • 53.
    MALDI MS topdown sequencing: • 0.5-1 ml salt-free protein solution placed on a MALDI- plate, covered with the MALDI Matrix solution, is analyzed in the in-source decay mode on an UltrafleXtreme mass spectrometer. The generated mass spectrum (a complex mass spectrum, exhibiting mainly c- and y ions) is further analyzed with the software Bio. tools or is processed via a Mascot search. • A .pdf result file with sequence coverage of the target sequence would be the result output.
  • 54.
    UltrafleXtreme mass spectrometer •The UTX is used for a variety of MALDI applications, including mass spectrometry imaging (MSI), protein identification, peptide fingerprinting, and structure identification for a wide spectrum of biomoles (including lipids, polymers, glycans).
  • 55.
  • 57.
  • 58.
    Introduction • MS/MS playsimportant role in protein identification (fast and sensitive) • Derivation of peptide sequence an important task in proteomics • Derivation without help from a protein database (“de novo sequencing”), especially important in identification of unknown protein
  • 59.
    Basic lab experimentalsteps 1. Proteins digested w/ an enzyme to produce peptides 2. Peptides charged (ionized) and separated according to their different m/z ratios 3. Each peptide fragmented into ions and m/z values of fragment ions are measured • Steps 2 and 3 performed within a tandem mass spectrometer.
  • 60.
    Mass spectrum • Proteinsconsist of 20 different types of a. a. with different masses (except for one pair Leu and Ile) • Different peptides produce different spectra • Use the spectrum of a peptide to determine its sequence
  • 61.
    Objectives • Describe thesteps of a typical peptide analysis by MS (proteomic experiment) • Explain peptide ionization, fragmentation, identification
  • 62.
    Why are peptides,and not proteins, sequenced? • Solubility under the same conditions • Sensitivity of MS much higher for peptides • MS efficiency
  • 63.
  • 64.
    Choice of Enzyme Cleaving agent/Proteases Specificity A.HIGHLY SPECIFIC Trypsin Arg-X, Lys-X Endoproteinase Glu-C Glu-X Endoproteinase Lys-C Lys-X Endoproteinase Arg-C Arg-X Endoproteinase Asp-N X-Asp B. NONSPECIFIC Chymotrypsin Phe-X, Tyr-X, Trp-X, Leu-X Thermolysin X-Phe, X-Leu, X-Ile, X-Met, X-Val, X-Ala
  • 65.
    ESI Liquid flow Q orIon Trap analyzer ESI is a solution technique that gives a continuous stream of ions, best for quadrupoles, ion traps, etc. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + MALDI 3 nS LASER PULSE Sample (solid) on target at high voltage/ high vacuum MALDI is a solid-state technique that gives ions in pulses, best suited to time-of-flight MS. TOF analyzer Atmosphere Low vac. High vac. High vacuum
  • 67.
    ….MALDI or Electrospray? MALDI is limited to solid state, ESI to liquid ESI is better for the analysis of complex mixture as it is directly interfaced to a separation techniques (i.e. HPLC or CE) MALDI is more “flexible” (MW from 200 to 400,000 Da)
  • 68.
    Q2 Collision Cell Q3 I II III Correlative sequence database searching TheoreticalAcquired Protein identification Peptides 1D, 2D, 3D peptide separation 200 400 600 80010001200 m/z 200 400 600 80010001200 m/z 200 400 600 80010001200 m/z 12 14 16 Time (min) Tandem mass spectrum Protein Identification Strategy Q1 * * Protein mixture 10-Mar-200514:28:10 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 m/z 0 100 % CAL050310A 71 (1.353) Cm (1:96) TOF MSMS785.60ES+ 2.94e3 684.17 333.15 187.07 175.12 169.06 246.13 286.11 480.16 382.11 480.08 497.09 627.17 612.08 498.09 813.16 785.62 685.18 740.09 1285.14 1056.17 942.16 814.17 924.16 943.17 1039.13 1038.17 1171.14 1057.18 1058.17 1172.15 1173.16 1286.14 1287.13 1296.10 10-Mar-200514:28:10 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 m/z 0 100 % CAL050310A 71 (1.353) Cm (1:96) TOF MSMS785.60ES+ 2.94e3 684.17 333.15 187.07 175.12 169.06 246.13 286.11 480.16 382.11 480.08 497.09 627.17 612.08 498.09 813.16 785.62 685.18 740.09 1285.14 1056.17 942.16 814.17 924.16 943.17 1039.13 1038.17 1171.14 1057.18 1058.17 1172.15 1173.16 1286.14 1287.13 1296.10 10-Mar-200514:28:10 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 m/z 0 100 % CAL050310A 71 (1.353) Cm (1:96) TOF MSMS785.60ES+ 2.94e3 684.17 333.15 187.07 175.12 169.06 246.13 286.11 480.16 382.11 480.08 497.09 627.17 612.08 498.09 813.16 785.62 685.18 740.09 1285.14 1056.17 942.16 814.17 924.16 943.17 1039.13 1038.17 1171.14 1057.18 1058.17 1172.15 1173.16 1286.14 1287.13 1296.10
  • 69.
    Breaking Protein intoPeptides and Peptides into Fragment Ions •Proteases, e.g. trypsin, break protein into peptides •MS/MS breaks the peptides down into fragment ions and measures the mass of each piece •MS measure m/z ratio of an ion
  • 70.
    Peptide fragmentation Amino acids differin their side chains Predominant fragmentation Weakest bonds
  • 71.
    Tendency of peptidesto fragment at Asp (D) Mass Spectrometry in Proteomics Ruedi Aebersold* and David R. Goodlett 269 Chem. Rev. 2001, 101, 269-295 C-terminal side of Asp
  • 72.
    Protein Identification byMS Artificial spectra built Artificially trypsinated Database of sequences (i.e. SwissProt) Spot removed from gel Fragmented using trypsin Spectrum of fragments generated MATCH Library
  • 73.
    Conclusions • MS ofpeptides enables high throughput identification and characterization of proteins in biological systems • “de novo sequencing” can be used to identify unknown proteins not found in protein databases
  • 74.
  • 75.
    • The rapidincrease of publicly available sequences and protein structures means that an increasing amount of information can be obtained for any protein sequence through its relatedness to others. • If a set of homologous proteins can be found and aligned, the information content at each position in the alignment profile is far greater than in any single member of the family, and any structural or functional prediction algorithm should utilize this collective information. Profile information of this type is extremely sensitive to the quality of the multiple alignment, and distant homologues should only be included in the alignment if they can be aligned with confidence.
  • 77.
  • 78.
    DNA template strand TRANSCRIPTION mRNA TRANSLATION Protein Amino acid Codon Trp PheGly 5 5 Ser U U U U U 3 3 5 3 G G G G C C T C A A A A A A A T T T T T G G G G C C C G G DNA molecule Gene 1 Gene 2 Gene 3 C C
  • 79.
    • Protein structureprediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its folding and its secondary and tertiary structure from its primary structure. • Structure prediction is fundamentally different from the inverse problem of protein design.
  • 80.
    • Protein structureprediction is one of the most important goals pursued by bioinformatics and theoretical chemistry; it is highly important in medicine (for example, in drug design) and biotechnology (for example, in the design of novel enzymes).
  • 81.
    •Protein structure andterminology • Proteins are chains of amino acids joined together by peptide bonds. Many conformations of this chain are possible due to the rotation of the chain about each Cα atom. It is these conformational changes that are responsible for differences in the three dimensional structure of proteins. • Each amino acid in the chain is polar, i.e. it has separated positive and negative charged regions with a free C=O group, which can act as hydrogen bond acceptor and an NH group, which can act as hydrogen bond donor. These groups can therefore interact in the protein structure. • The 20 amino acids can be classified according to the chemistry of the side chain which also plays an important structural role. Glycine takes on a special position, as it has the smallest side chain, only one Hydrogen atom, and therefore can increase the local flexibility in the protein structure. Cysteine on the other hand can react with another cysteine residue and thereby form a cross link stabilizing the whole structure.
  • 82.
    • The proteinstructure can be considered as a sequence of secondary structure elements, such as α helices and β sheets, which together constitute the overall three-dimensional configuration of the protein chain. In these secondary structures regular patterns of H bonds are formed between neighboring amino acids, and the amino acids have similar Φ and Ψ angles. • Bond angles for ψ and ω • The formation of these structures neutralizes the polar groups on each amino acid. The secondary structures are tightly packed in the protein core in a hydrophobic environment. Each amino acid side group has a limited volume to occupy and a limited number of possible interactions with other nearby side chains, a situation that must be taken into account in molecular modeling and alignments. [
  • 83.
    • α Helix •The α helix is the most abundant type of secondary structure in proteins. The α helix has 3.6 amino acids per turn with an H bond formed between every fourth residue; the average length is 10 amino acids (3 turns) or 10 Å but varies from 5 to 40 (1.5 to 11 turns). • The alignment of the H bonds creates a dipole moment for the helix with a resulting partial positive charge at the amino end of the helix. Because this region has free NH2 groups, it will interact with negatively charged groups such as phosphates. • The most common location of α helices is at the surface of protein cores, where they provide an interface with the aqueous environment. The inner-facing side of the helix tends to have longer helices, forming a bend.
  • 84.
    • hydrophobic aminoacids and the outer-facing side hydrophilic amino acids. • Thus, every third of four amino acids along the chain will tend to be hydrophobic, a pattern that can be quite readily detected. In the leucine zipper motif, a repeating pattern of leucines on the facing sides of two adjacent helices is highly predictive of the motif.
  • 86.
    • β sheet •β sheets are formed by H bonds between an average of 5–10 consecutive amino acids in one portion of the chain with another 5–10 farther down the chain. • The interacting regions may be adjacent, with a short loop in between, or far apart, with other structures in between. Every chain may run in the same direction to form a parallel sheet, every other chain may run in the reverse chemical direction to form an anti parallel sheet, or the chains may be parallel and anti parallel to form a mixed sheet.
  • 87.
    • The patternof H bonding is different in the parallel and anti parallel configurations. Each amino acid in the interior strands of the sheet forms two H bonds with neighboring amino acids, whereas each amino acid on the outside strands forms only one bond with an interior strand • . Looking across the sheet at right angles to the strands, more distant strands are rotated slightly counterclockwise to form a left-handed twist. The Cα atoms alternate above and below the sheet in a pleated structure, and the R side groups of the amino acids alternate above and below the pleats.
  • 88.
    • Loop • Loopsare regions of a protein chain that are • (1) between α helices and β sheets, • (2) of various lengths and three-dimensional configurations, and • (3) on the surface of the structure. • Hairpin loops that represent a complete turn in the polypeptide chain joining two antiparallel β strands may be as short as two amino acids in length.
  • 89.
    • Loops interactwith the surrounding aqueous environment and other proteins. Because amino acids in loops are not constrained by space and environment as are amino acids in the core region, and do not have an effect on the arrangement of secondary structures in the core, more substitutions, insertions, and deletions may occur. Thus, in a sequence alignment, the presence of these features may be an indication of a loop.
  • 90.
    • The positionsof introns in genomic DNA sometimes correspond to the locations of loops in the encoded protein[ . Loops also tend to have charged and polar amino acids and are frequently a component of active sites. A detailed examination of loop structures has shown that they fall into distinct families.
  • 91.
    • Coils • Aregion of secondary structure that is not a α helix, a β sheet, or a recognizable turn is commonly referred to as a coil.
  • 92.
  • 93.
    • In Functionalgenomics: functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic and transcriptomic projects (such as genome sequencing projects and RNA sequencing) to describe gene (and protein) functions and interactions. Unlike genomics, functional genomics focuses on the dynamic aspects such as
  • 94.
    • gene transcription,translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures.
  • 95.
    • The goalof functional genomics is to understand the relationship between an organism's genome and its phenotype. The term functional genomics is often used broadly to refer to the many possible approaches to understanding the properties and function of the entirety of an organism's genes and gene products.
  • 96.
    • The promiseof functional genomics is to expand and synthesize genomic and proteomic knowledge into an understanding of the dynamic properties of an organism at cellular and/or organismal levels. This would provide a more complete picture of how biological function arises from the information encoded in an organism's genome. • The possibility of understanding how a particular mutation leads to a given phenotype has important implications for human genetic diseases, as answering these questions could point scientists in the direction of a treatment or cure.
  • 97.
    Prediction of proteinfunction from protein sequence and structure The sequence of a genome contains the plans of the possible life of an organism, but implementation of genetic information depends on the functions of the proteins and nucleic acids that it encodes. Many individual proteins of known sequence and structure present challenges to the understanding of their function.
  • 98.
    • In particular,a number of genes responsible for diseases have been identified but their specific functions are unknown. Whole-genome sequencing projects are a major source of proteins of unknown function. Annotation of a genome involves assignment of functions to gene products, in most cases on the basis of amino-acid sequence alone.
  • 99.
    3D structure canaid the assignment of function, motivating the challenge of structural genomics projects to make structural information available for novel uncharacterized proteins. Structure-based identification of homologues often succeeds where sequence-alone-based methods fail, because in many cases evolution retains the folding pattern long after sequence similarity becomes undetectable.
  • 100.
    • Nevertheless, predictionof protein function from sequence and structure is a difficult problem, because homologous proteins often have different functions. Many methods of function prediction rely on identifying similarity in sequence and/or structure between a protein of unknown function and one or more well-understood proteins. Alternative methods include inferring conservation patterns in members of a functionally uncharacterized family for which many sequences and structures are known.
  • 101.
    In Proteomics Proteomics isthe large-scale study of proteomes. A proteome is a set of proteins produced in an organism, system, or biological context. The proteome is not constant; it differs from cell to cell and changes over time. To some degree, the proteome reflects the underlying transcriptome. However, protein activity (often assessed by the reaction rate of the processes in which the protein is involved) is also modulated by many factors in addition to the expression level of the relevant gene.
  • 102.
    Protein sequencing denotesthe process of finding the amino acid sequence, or primary structure of a protein. Sequencing plays a very vital role in Proteomics as the information obtained can be used to deduce function, structure, and location which in turn aids in identifying new or novel proteins as well as understanding of cellular processes. Better understanding of these processes allows for creation of drugs that target specific metabolic pathways among other things.
  • 103.
    In Bioinformatics What isbioinformatics? In recent years, molecular biology has witnessed an information revolution as a result of the development of rapid DNA sequencing techniques and the corresponding progress in computer-based technologies, which are allowing us to cope with this information deluge in increasingly efficient ways. The term that was coined to encompass computer applications in biological sciences is bioinformatics.
  • 104.
    The term bioinformaticsis now used to mean rather different things, from artificial intelligence and robotics to genome analysis. The term was originally applied to the computational manipulation and analysis of biological sequence data (DNA and/or protein), but now tends also to be used to embrace the manipulation and analysis of 3D structural data.
  • 105.
    Identifying protein-coding genesin genomic sequences The vast majority of the biology of a newly sequenced genome is inferred from the set of encoded proteins. Predicting this set is therefore invariably the first step after the completion of the genome DNA sequence. The genome sequence is an organism's blueprint: the set of instructions dictating its biological traits. The unfolding of these instructions is initiated by the transcription of the DNA into RNA sequences. According to the standard model, the majority of RNA sequences originate from protein-coding genes; that is, they are processed into messenger RNAs (mRNAs) which, after their export to the cytosol, are translated into proteins.
  • 106.
    •To Determine theprotein folding •Protein folding is the process by which a protein structure assumes its functional shape or conformation. •Protein folding is the physical process by which a protein chain acquires its native 3-dimensional structure, a conformation that is usually biologically functional, in an expeditious and reproducible manner. •It is the physical process by which a polypeptide folds into its characteristic and functional three-dimensional structure from random coil. Each protein exists as an unfolded polypeptide or random coil when translated from a sequence of mRNA to a linear chain of amino acids.
  • 107.
    •All protein moleculesare heterogeneous unbranched chains of amino acids. •By coiling and folding into a specific three-dimensional shape they are able to perform their biological function. •Proteins are formed from long chains of amino acids; they exist in an array of different structures which often dictate their functions. Proteins follow energetically favorable pathways to form stable, orderly, structures; this is known as the proteins’ native structure. • Most proteins can only perform their various functions when they are folded. Scientists believe that the instructions for folding a protein are encoded in the sequence. Researchers and scientists can easily determine the sequence of a protein, but have not cracked the code that governs folding .
  • 108.
    In Drugs production Whatis Protein Drug A type of drugs made of protein. These drugs usually have large molecular weight with protein characteristics. structure of an unusual class of proteins called beta-peptides. Eventually, these peptides could become the basis for drugs that are cheaper to manufacture than existing protein-based pharmaceuticals and last longer in the body. A drug's efficiency may be affected by the degree to which it binds to the proteins within blood plasma. The less bound a drug is, the more efficiently it can traverse cell membranes or diffuse. Common blood proteins that drugs bind to are human serum albumin, lipoprotein, glycoprotein, α, β‚ and γ globulins

Editor's Notes

  • #77 Figure 17.3 Overview: the roles of transcription and translation in the flow of genetic information.
  • #78 Figure 17.4 The triplet code.