Protein sequencing

Introduction to Protein
Sequencing

What is Protein
• Any of a class of nitrogenous organic
compounds which have large molecules
composed of one or more long chains of
amino acids and are an essential part of all
living organisms, especially as structural
components of body tissues such as muscle,
hair, etc., and as enzymes and antibodies."a
protein found in wheat"

What is sequence
• a particular order in which related things
follow each other.
• a set of related events, movements, or items
that follow each other in a particular order.

Protein Sequencing
• Protein sequencing is the practical process of
determining the amino acid sequence of all or
part of a protein or peptide. This may serve to
identify the protein or characterize its post-
translational modifications.

• Typically, partial sequencing of a protein
provides sufficient information (one or more
sequence tags) to identify it with reference to
databases of protein sequences derived from
the conceptual translation of genes.

• The two major direct methods of protein
sequencing are mass spectrometry and Edman
degradation using a protein
sequenator (sequencer). Mass spectrometry
methods are now the most widely used for
protein sequencing and identification but
Edman degradation remains a valuable tool
for characterizing a protein's N-terminus.

Why we do Protein sequencing??
• Determining amino acid composition.
• It is often desirable to know the unordered
amino acid composition of a protein prior to
attempting to find the ordered sequence, as
this knowledge can be used to facilitate the
discovery of errors in the sequencing process
or to distinguish between ambiguous results

• . Knowledge of the frequency of certain amino
acids may also be used to choose
which protease to use for digestion of the
protein. The disincorporation of low levels of
non-standard amino acids (e.g. norleucine) into
proteins may also be determined.
• A generalized method often referred to as amino
acid analysis for determining amino acid
frequency is as follows:
• Hydrolyse a known quantity of protein into its
constituent amino acids.
• Separate and quantify the amino acids in some
way.

• Hydrolysis
• Hydrolysis is done by heating a sample of the
protein in 6 M hydrochloric acid to 100–110 °C
for 24 hours or longer. Proteins with many
bulky hydrophobic groups may require longer
heating periods. However, these conditions
are so vigorous that some amino acids
(serine, threonine, tyrosine, tryptophan, gluta
mine, and cysteine) are degraded. To
circumvent this problem,

• Biochemistry Online suggests heating separate
samples for different times, analysing each
resulting solution, and extrapolating back to zero
hydrolysis time. Rastall suggests a variety of
reagents to prevent or reduce degradation, such
as thiol reagents or phenol to protect tryptophan
and tyrosine from attack by chlorine, and pre-
oxidising cysteine. He also suggests measuring
the quantity of ammonia evolved to determine
the extent of amide hydrolysis.

• Separation and quantitation
• The amino acids can be separated by ion-
exchange chromatography then derivatized to
facilitate their detection. More commonly, the
amino acids are derivatized then resolved
by reversed phase HPLC.

• An example of the ion-exchange chromatography
is given by the NTRC using sulfonated polystyrene
as a matrix, adding the amino acids in acid
solution and passing a buffer of steadily
increasing pH through the column. Amino acids
are eluted when the pH reaches their
respective isoelectric points. Once the amino
acids have been separated, their respective
quantities are determined by adding a reagent
that will form a coloured derivative.

• The advent of protein sequencing can be
traced to two almost parallel discoveries by
Frederick Sanger and Pehr Edman.
• In 1950, Pehr Edman published a paper
demonstrating a label-cleavage method for
protein sequencing which was later termed
“Edman degradation”.

• Pehr Edman began his work in the Northrop-
Kunitz laboratory at the Princeton branch of the
Rockefeller Institute of Medical Research in 1947
• where he attempted to find a method to decode
the amino acid sequence of a protein using
chemicals; specifically he had early success with
• fluorodinitrobenzene (FDNB) and
phenylisothiocyanate (PITC).

• Throughout his year at Princeton, Edman was able to
conduct enough experiments to understand that it was
feasible to use reagents like FDNB and PITC to determine
amino acid sequence.
• Edman returned to Sweden in 1947 and after two more
years of work he was able to publish his paper that would
describe the first successful method to sequence proteins
[1]
• This ground breaking paper described a method to
determine the amino acid sequence of a protein and would
come to be known as the Edman Degradation.

F.SANGER
• Around the same time Fred Sanger was
developing his own labeling and separation
method which led to the sequencing of
insulin.
• For this work, Sanger was awarded the 1958
Nobel Prize for Chemistry.

Plus and minus in the 1970’s
• Fast-forward once again to the 1970’s and we find Fred
Sanger still at the forefront of nucleic acid sequencing.
• In 1975 whilst at the Laboratory of Molecular Biology in
Cambridge, Fred Sanger developed the “plus and minus”
method for DNA sequencing (Sanger and Coulson, 1975).
• Again there was competition in the field with Maxam and
Gilbert working on degradation sequencing (Maxam and
Glibert, 1977) however, their method was ultimately to
falter due to the ease and quality of the Sanger method.

plus and minus method
• A primer is extended by a polymerase to generate a population of
newly synthesized deoxyribonucleotides of assorted lengths; the
unused dNTPs are removed, and polymerization continues in four
pairs of plus and minus reaction mixtures; the minus mixtures have
three NTPs and the plus mixtures have only one.
• After a second polymerization, the mixtures are fractionated by gel
electrophoresis, and each plus and minus pair is compared to
indicate the length of the new polydeoxyribonucleotide (by the
mobilities of the bands) and the position at which polymerization
had terminated as a result of the absence of the missing dNTP

• Five years earlier, Frederick Sanger had demonstrated a
method to determine the amino acid residue located
on the N-terminal end of a polypeptide chain by using
the reagent fluorodinitrobenzene.
• While it was thought, that at most, this method could
only provide the sequences found on the N-terminal,
• Sanger was able to take the method one step further.

• By using several proteolytic enzymes, partial
hydrolysis and early version of chromatography,
Sanger was able to cleave the protein into
fragments and piece together the residues like a
jigsaw puzzle.
• It wasn’t until 1955 that Sanger was able to
present the complete sequence of insulin which
led to him being awarded a Nobel Prize in
Chemistry in 1958.

Other scientist
• Emile Zuckerkandl and Linus Pauling, whose
work in the mid1960s advanced the use of
nucleotide and protein sequences to explore
evolution
• In the 1970s,Carl Woese used ribosomal RNA
sequences to define archaebacteria as a group
of living organisms distinct from other
bacteria and eukaryotes

Protein sequencing
• Technique to find out the sequence of amino
acids in a protein
Sequencing methods
1-N-terminal sequencing
(Edman degradation)
2-C-terminal sequencing
3-Prediction from DNA sequence

Edman degradation
N-terminal sequencing

STEPS
• Protein purification
• Protein denaturation
• Protein digestion
• N-terminal labeling
• Separation of labeled amino acid by
chromatography
• Detection through mass spectrometry
• Data analysis

Protein isolation(purification)
• 1-SDS-PAGE
(sodium dodecyl sulfate-poly
acryl amide gel)
2-Two dimensional gels
Protein of interest is
immobilized by being
absorbed onto a chemically
modified glass or by electro
blotting onto a porous
polyvinylidene fluoride
(PVDF) membrane.

by heating a sample of the
protein in 6 Molar HCL up
to 100-110 degrees Celsius
for 24 hours or longer
It may degrade some amino
acids
To avoid this
Thiol reagents or phenol are
used
- Performic acid for intra
chain or inter chain S-S
bonds
Protein hydrolysis(denaturation)

Protein digestion
• Use Endoproteinase Lys-C, CNBr, Pepsin or
trypsin to digest proteins into a population of
peptides
• Other enzymes include Glu-C and
chymotrypsin
• Add enzyme at 1:20 enzyme: protein ratio
• incubate at room temperature for 6-9hrs
• For better results use mixture of enzymes

N-terminal labeling
• The Edman reagent, phenylisothiocyanate (PTC), is
added to the adsorbed peptide, together with a mildly
basic buffer solution of 12% trimethylamine
• This reacts with the amine group of the N-terminal
amino acid
• The terminal amino acid can then be selectively
detached by the addition of anhydrous acid
• The derivative then isomerises to give a substituted
phenylthiohydantoin which can be washed off and
identified by chromatography, and the cycle can be
repeated

CHROMATOGRAPHY
• Chromatography is a
technique in which
molecules are separated
based on volatility and bond
characteristics when
subjected to a carrier
• Derivatives of amino acid
can be separated by
• 1-HPLC
• 2-Gas chromatography
• In gas chromatography (GC),
the mobile phase is an inert
gas such as helium

MASS SPECTROMETERY
• Mass spectrometry (MS) is an analytical
technique that measures the mass-to-charge
ratio of charged particles
• The MS principle consists of ionizing chemical
compounds to generate charged molecules or
molecule fragments and measuring their
mass-to-charge ratios
• Separated amino acid derivatives are analyzed
by mass spectrometer

MS procedure
• A sample is loaded onto the MS instrument, and
undergoes vaporization
• The components of the sample are ionized by one of a
variety of methods (e.g., by impacting them with an
electron beam), which results in the formation of
charged particles (ions)
• The ions are separated according to their mass-to-
charge ratio in an analyzer by electromagnetic fields
• The ions are detected, usually by a quantitative
method
• The ion signal is processed into mass spectra

• first strategy for
identifying an unknown
compound is to compare
its experimental mass
spectrum against a library
of mass spectra
• Standard solutions of
amino acids are also used
and the resulting pattern
is compared with
standard spectrum.
MS data analysis

Limitations of Edman degradation
• Need Pure Samples of Peptides
• Requires 40-60 min / Amino Acid
• Can’t Analyze N-Terminally Modified Peptides
• Advantages
• Most Reliable Sequencing Technique

Definition:
The C-terminus (also known as
the carboxyl-terminus, carboxyl-terminus, C-
terminal tail, C-terminal end, or COOH-
terminus) is the end of an amino acid chain
(protein or polypeptide), terminated by a
free carboxyl group (-COOH).
C terminal

C-terminal retention signals
• Proteins are naturally synthesized starting from
the N-terminus and ending at the C-terminus.
• While the N-terminus of a protein often
contains targeting signals, the C-terminus can
contain retention signals for protein sorting.
• The most common ER retention signal is the
amino acid sequence -KDEL (Lys-Asp-Glu-Leu)
or -HDEL (His-Asp-Glu-Leu) at the C-terminus.
This keeps the protein in the endoplasmic
reticulum and prevents it from entering
the secretory pathway.

C-terminal modifications
• The C-terminus of proteins can be modified post
translationally, most commonly by the addition of
a lipid anchor to the C-terminus that allows the
protein to be inserted into a membrane without
having a trans membrane domain.
• Another form of C-terminal modification is the
addition of a
phosphoglycan, glycosylphosphatidylinositol (GPI),
as a membrane anchor. The GPI anchor is attached
to the C-terminus after proteolytic cleavage of a C-
terminal propeptied. The most prominent example
for this type of modification is the prion protein.

C-terminal domain:
• The C-terminal domain of some proteins has
specialized functions. In humans, the CTD
of RNA polymerase II typically consists of up to
52 repeats of the sequence Tyr-Ser-Pro-Thr-
Ser-Pro-Ser.[1] This allows other proteins to
bind to the C-terminal domain of RNA
polymerase in order to activate polymerase
activity. These domains then involved in
the initiation of DNA transcription.

C terminal sequencing technique
• Top Down sequencing by MALDI ISD is used to
sequence the c terminal of amino acid chain.
• MALDI MS: “matrix-assisted laser
desorption/ionization mass spectrometry”
through which the c-terminal can be analyzed.
• This method is used when the N-terminal is
blocked and there is only C-terminal available.
• The technique can fragment and sequence
both the N- and C-terminal in the same mass
spectrum.

• Admen degradation is only used for N-
terminal sequencing.
• The most common method is to add
carboxy peptidases to a solution of the
protein.
• Take a sample at regular at regular intervals
and determine the terminal amino acid by
analyzing a plot amino acid concentration
and time.

• A peptide mixture is generated by cleavage of the
protein with cyanogen bromide and is incubated
with carboxy peptidase Y.
• The enzyme is only able to act on the C-terminal
fragment, because this is the only peptide without a
homoserine lactone residue at its C terminus.
• The resulting fragments, forming a peptide ladder,
are analyzed by matrix-assisted laser
desorption/ionization mass spectrometry (MALDI-
MS).
• The entire protocol, including the CNBr cleavage,
takes 21 h and can be applied to proteins purified
either by SDS-PAGE or by 2D PAGE or in solution.
Use of peptidase:

Top down sequencing:
• Top-down proteomics is a method
of protein identification that uses an ion trapping mass
spectrometer to store an isolated protein ion for mass
measurement and tandem mass spectrometry analysis.
• Top-down proteomics is capable of identifying and
quantitating unique proteoforms through the analysis
of intact protein.
• Top-down proteomics interrogates protein structure
through measurement of an intact mass followed by
direct ion dissociation in the gas phase.

• Fragmentation for tandem mass spectrometry
is accomplished by electron-capture
dissociation or electron-transfer dissociation.
Effective fractionation is critical for sample
handling before mass-spectrometry-based
proteomics. Proteome analysis routinely
involves digesting intact proteins followed by
inferred protein identification using mass
spectrometry.
• The main advantages of the top-down
approach include the ability to detect
degradation products, sequence variants, and
combinations of post-translational
modifications.

MALDI MS top down sequencing:
• 0.5-1 ml salt-free protein solution placed on
a MALDI-plate, covered with the MALDI Matrix
solution, is analyzed in the in-source decay mode
on an UltrafleXtreme mass spectrometer. The
generated mass spectrum (a complex mass
spectrum, exhibiting mainly c- and y ions) is
further analyzed with the software Bio. tools or is
processed via a Mascot search.
• A .pdf result file with sequence coverage of the
target sequence would be the result output.

UltrafleXtreme mass spectrometer
• The UTX is used for a
variety of MALDI
applications, including
mass spectrometry imaging
(MSI), protein
identification, peptide
fingerprinting, and
structure identification for
a wide spectrum of
biomoles (including lipids,
polymers, glycans).

Peptide Sequencing by Mass
Spectrometry

Introduction
• MS/MS plays important role in protein identification (fast
and sensitive)
• Derivation of peptide sequence an important task in
proteomics
• Derivation without help from a protein database (“de novo
sequencing”), especially important in identification of
unknown protein

Basic lab experimental steps
1. Proteins digested w/ an enzyme to produce peptides
2. Peptides charged (ionized) and separated according
to their different m/z ratios
3. Each peptide fragmented into ions and m/z values of
fragment ions are measured
• Steps 2 and 3 performed within a tandem mass
spectrometer.

Mass spectrum
• Proteins consist of 20 different types of a. a. with
different masses (except for one pair Leu and Ile)
• Different peptides produce different spectra
• Use the spectrum of a peptide to determine its
sequence

Objectives
• Describe the steps of a typical peptide analysis
by MS (proteomic experiment)
• Explain peptide ionization, fragmentation,
identification

Why are peptides, and not proteins,
sequenced?
• Solubility under the same conditions
• Sensitivity of MS much higher for peptides
• MS efficiency

Choice of Enzyme
Cleaving
agent/Proteases
Specificity
A. HIGHLY SPECIFIC
Trypsin Arg-X, Lys-X
Endoproteinase Glu-C Glu-X
Endoproteinase Lys-C Lys-X
Endoproteinase Arg-C Arg-X
Endoproteinase Asp-N X-Asp
B. NONSPECIFIC
Chymotrypsin Phe-X, Tyr-X, Trp-X, Leu-X
Thermolysin X-Phe, X-Leu, X-Ile, X-Met, X-Val, X-Ala

ESI
Liquid flow
Q or Ion Trap
analyzer
ESI is a solution technique that gives a continuous stream of ions,
best for quadrupoles, ion traps, etc.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+++ ++++++ ++++++ ++++++ +++
+++ ++++++ ++++++ ++++++ +++ +
+
+
+
+
++
+
+
+
+
++
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
++
+
+
+
+
++
+
+
+
+
+
MALDI
3 nS LASER PULSE
Sample (solid) on target at
high voltage/ high vacuum
MALDI is a solid-state technique that gives ions in pulses,
best suited to time-of-flight MS.
TOF analyzer
Atmosphere Low vac. High vac.
High vacuum

….MALDI or Electrospray ?
MALDI is limited to solid state, ESI to liquid
ESI is better for the analysis of complex mixture as it is directly interfaced to a
separation techniques (i.e. HPLC or CE)
MALDI is more “flexible” (MW from 200 to 400,000 Da)

Q2
Collision Cell
Q3
I
II
III
Correlative
sequence database
searching
Theoretical Acquired
Protein identification
Peptides
1D, 2D, 3D peptide separation
200 400 600 80010001200
m/z
200 400 600 80010001200
m/z
200 400 600 80010001200
m/z
12 14 16
Time (min)
Tandem mass spectrum
Protein Identification Strategy
Q1
*
*
Protein
mixture
10-Mar-200514:28:10
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
m/z0
100
%
CAL050310A 71 (1.353) Cm (1:96) TOF MSMS 785.60ES+
2.94e3684.17
333.15
187.07
175.12
169.06
246.13
286.11
480.16
382.11
480.08
497.09
627.17
612.08
498.09
813.16
785.62
685.18
740.09
1285.141056.17942.16
814.17
924.16
943.17
1039.13
1038.17
1171.14
1057.18
1058.17
1172.15
1173.16
1286.14
1287.13
1296.10
10-Mar-200514:28:10
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
m/z0
100
%
2.94e3684.17
333.15
187.07
175.12
169.06
246.13
286.11
480.16
382.11
480.08
497.09
627.17
612.08
498.09
813.16
785.62
685.18
740.09
1285.141056.17942.16
814.17
924.16
943.17
1039.13
1038.17
1171.14
1057.18
1058.17
1172.15
1173.16
1286.14
1287.13
1296.10
10-Mar-200514:28:10
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
m/z0
100
%
2.94e3684.17
333.15
187.07
175.12
169.06
246.13
286.11
480.16
382.11
480.08
497.09
627.17
612.08
498.09
813.16
785.62
685.18
740.09
1285.141056.17942.16
814.17
924.16
943.17
1039.13
1038.17
1171.14
1057.18
1058.17
1172.15
1173.16
1286.14
1287.13
1296.10

Breaking Protein into Peptides and Peptides into
Fragment Ions
• Proteases, e.g. trypsin, break protein into
peptides
• MS/MS breaks the peptides down into fragment
ions and measures the mass of each piece
• MS measure m/z ratio of an ion

Peptide fragmentation
Amino acids differ
in their side chains
Predominant
fragmentation
Weakest bonds

Tendency of peptides to fragment at Asp (D)
Mass Spectrometry in Proteomics
Ruedi Aebersold* and David R. Goodlett
269 Chem. Rev. 2001, 101, 269-295
C-terminal side of Asp

Protein Identification by MS
Artificial
spectra built
Artificially
trypsinated
Database of
sequences
(i.e. SwissProt)
Spot removed
from gel
Fragmented
using trypsin
Spectrum of
fragments
generated
MATCH
Library

Conclusions
• MS of peptides enables high throughput
identification and characterization of proteins in
biological systems
• “de novo sequencing” can be used to identify
unknown proteins not found in protein databases

• The rapid increase of publicly available sequences and protein
structures means that an increasing amount of information can be
obtained for any protein sequence through its relatedness to
others.
• If a set of homologous proteins can be found and aligned, the
information content at each position in the alignment profile is far
greater than in any single member of the family, and any structural
or functional prediction algorithm should utilize this collective
information. Profile information of this type is extremely sensitive
to the quality of the multiple alignment, and distant homologues
should only be included in the alignment if they can be aligned with
confidence.

Figure 17.3b-3
RNA PROCESSING
Nuclear
envelope
DNA
Pre-mRNA
mRNA
TRANSCRIPTION
TRANSLATION Ribosome
Polypeptide

DNA
template
strand
TRANSCRIPTION
mRNA
TRANSLATION
Protein
Amino acid
Codon
Trp Phe Gly
5
5
Ser
U U U U U
3
3
53
G
G
G G C C
T
C
A
A
AAAAA
T T T T
T
G
G G G
C C C G G
DNA
molecule
Gene 1
Gene 2
Gene 3
C C

• Protein structure prediction is the inference
of the three-dimensional structure of
a protein from its amino acid sequence—that
is, the prediction of its folding and
its secondary and tertiary structure from
its primary structure.
• Structure prediction is fundamentally
different from the inverse problem of protein
design.

• Protein structure prediction is one of the most
important goals pursued
by bioinformatics and theoretical chemistry; it is
highly important in medicine (for example, in drug
design) and biotechnology (for example, in the
design of novel enzymes).

• Protein structure and terminology
• Proteins are chains of amino acids joined together by peptide
bonds. Many conformations of this chain are possible due to the
rotation of the chain about each Cα atom. It is these
conformational changes that are responsible for differences in the
three dimensional structure of proteins.
• Each amino acid in the chain is polar, i.e. it has separated positive
and negative charged regions with a free C=O group, which can act
as hydrogen bond acceptor and an NH group, which can act as
hydrogen bond donor. These groups can therefore interact in the
protein structure.
• The 20 amino acids can be classified according to the chemistry of
the side chain which also plays an important structural
role. Glycine takes on a special position, as it has the smallest side
chain, only one Hydrogen atom, and therefore can increase the
local flexibility in the protein structure. Cysteine on the other hand
can react with another cysteine residue and thereby form a cross
link stabilizing the whole structure.

• The protein structure can be considered as a sequence of secondary
structure elements, such as α helices and β sheets, which together
constitute the overall three-dimensional configuration of the
protein chain. In these secondary structures regular patterns of H
bonds are formed between neighboring amino acids, and the amino
acids have similar Φ and Ψ angles.
• Bond angles for ψ and ω
• The formation of these structures neutralizes the polar groups on
each amino acid. The secondary structures are tightly packed in the
protein core in a hydrophobic environment. Each amino acid side
group has a limited volume to occupy and a limited number of
possible interactions with other nearby side chains, a situation that
must be taken into account in molecular modeling and alignments. [

• α Helix
• The α helix is the most abundant type of secondary
structure in proteins. The α helix has 3.6 amino acids per
turn with an H bond formed between every fourth residue;
the average length is 10 amino acids (3 turns) or 10 Å but
varies from 5 to 40 (1.5 to 11 turns).
• The alignment of the H bonds creates a dipole moment for
the helix with a resulting partial positive charge at the
amino end of the helix. Because this region has free
NH2 groups, it will interact with negatively charged groups
such as phosphates.
• The most common location of α helices is at the surface of
protein cores, where they provide an interface with the
aqueous environment. The inner-facing side of the helix
tends to have longer helices, forming a bend.

• hydrophobic amino acids and the outer-facing
side hydrophilic amino acids.
• Thus, every third of four amino acids along the
chain will tend to be hydrophobic, a pattern
that can be quite readily detected. In the
leucine zipper motif, a repeating pattern of
leucines on the facing sides of two adjacent
helices is highly predictive of the motif.

• β sheet
• β sheets are formed by H bonds between an
average of 5–10 consecutive amino acids in one
portion of the chain with another 5–10 farther
down the chain.
• The interacting regions may be adjacent, with a
short loop in between, or far apart, with other
structures in between. Every chain may run in the
same direction to form a parallel sheet, every
other chain may run in the reverse chemical
direction to form an anti parallel sheet, or the
chains may be parallel and anti parallel to form a
mixed sheet.

• The pattern of H bonding is different in the parallel and
anti parallel configurations. Each amino acid in the
interior strands of the sheet forms two H bonds with
neighboring amino acids, whereas each amino acid on
the outside strands forms only one bond with an
interior strand
• . Looking across the sheet at right angles to the
strands, more distant strands are rotated slightly
counterclockwise to form a left-handed twist. The Cα
atoms alternate above and below the sheet in a
pleated structure, and the R side groups of the amino
acids alternate above and below the pleats.

• Loop
• Loops are regions of a protein chain that are
• (1) between α helices and β sheets,
• (2) of various lengths and three-dimensional
configurations, and
• (3) on the surface of the structure.
• Hairpin loops that represent a complete turn
in the polypeptide chain joining two
antiparallel β strands may be as short as two
amino acids in length.

• Loops interact with the surrounding aqueous
environment and other proteins. Because amino
acids in loops are not constrained by space and
environment as are amino acids in the core region,
and do not have an effect on the arrangement of
secondary structures in the core, more substitutions,
insertions, and deletions may occur. Thus, in a
sequence alignment, the presence of these features
may be an indication of a loop.

• The positions of introns in genomic DNA sometimes
correspond to the locations of loops in the encoded protein[.
Loops also tend to have charged and polar amino acids and
are frequently a component of active sites. A detailed
examination of loop structures has shown that they fall into
distinct families.

• Coils
• A region of secondary structure that is not a α
helix, a β sheet, or a recognizable turn is
commonly referred to as a coil.

Applications of
Protein Sequencing

• In Functional genomics:
functional genomics is a field of molecular
biology that attempts to make use of the vast
wealth of data produced
by genomic and transcriptomic projects (such
as genome sequencing projects and RNA
sequencing) to describe gene (and protein)
functions and interactions. Unlike genomics,
functional genomics focuses on the dynamic
aspects such as

• gene transcription, translation, regulation of
gene expression and protein–protein
interactions, as opposed to the static aspects
of the genomic information such as DNA
sequence or structures.

• The goal of functional genomics is to
understand the relationship between an
organism's genome and its phenotype. The
term functional genomics is often used
broadly to refer to the many possible
approaches to understanding the properties
and function of the entirety of an organism's
genes and gene products.

• The promise of functional genomics is to expand and
synthesize genomic and proteomic knowledge into an
understanding of the dynamic properties of an
organism at cellular and/or organismal levels. This
would provide a more complete picture of how
biological function arises from the information
encoded in an organism's genome.
• The possibility of understanding how a particular
mutation leads to a given phenotype has important
implications for human genetic diseases, as answering
these questions could point scientists in the direction
of a treatment or cure.

Prediction of protein function from protein
sequence and structure
The sequence of a genome contains the plans of the possible life
of an organism, but implementation of genetic information
depends on the functions of the proteins and nucleic acids that it
encodes. Many individual proteins of known sequence and
structure present challenges to the understanding of their
function.

• In particular, a number of genes responsible for
diseases have been identified but their specific
functions are unknown. Whole-genome sequencing
projects are a major source of proteins of unknown
function. Annotation of a genome involves assignment
of functions to gene products, in most cases on the
basis of amino-acid sequence alone.

3D structure can aid the assignment of function, motivating the
challenge of structural genomics projects to make structural
information available for novel uncharacterized proteins.
Structure-based identification of homologues often succeeds
where sequence-alone-based methods fail, because in many
cases evolution retains the folding pattern long after sequence
similarity becomes undetectable.

• Nevertheless, prediction of protein function from sequence and
structure is a difficult problem, because homologous proteins
often have different functions. Many methods of function
prediction rely on identifying similarity in sequence and/or
structure between a protein of unknown function and one or
more well-understood proteins. Alternative methods include
inferring conservation patterns in members of a functionally
uncharacterized family for which many sequences and
structures are known.

In Proteomics
Proteomics is the large-scale study of proteomes. A proteome is
a set of proteins produced in an organism, system, or biological
context.
The proteome is not constant; it differs from cell to cell and
changes over time. To some degree, the proteome reflects the
underlying transcriptome. However, protein activity (often
assessed by the reaction rate of the processes in which the
protein is involved) is also modulated by many factors in
addition to the expression level of the relevant gene.

Protein sequencing denotes the process of finding the amino acid
sequence, or primary structure of a protein. Sequencing plays a
very vital role in Proteomics as the information obtained can be
used to deduce function, structure, and location which in turn
aids in identifying new or novel proteins as well as understanding
of cellular processes. Better understanding of these processes
allows for creation of drugs that target specific metabolic
pathways among other things.

In Bioinformatics
What is bioinformatics?
In recent years, molecular biology has witnessed an
information revolution as a result of the development of
rapid DNA sequencing techniques and the
corresponding progress in computer-based
technologies, which are allowing us to cope with this
information deluge in increasingly efficient ways. The
term that was coined to encompass computer
applications in biological sciences is bioinformatics.

The term bioinformatics is now used to mean
rather different things, from artificial
intelligence and robotics to genome analysis.
The term was originally applied to the
computational manipulation and analysis of
biological sequence data (DNA and/or protein),
but now tends also to be used to embrace the
manipulation and analysis of 3D structural data.

Identifying protein-coding genes in
genomic sequences
The vast majority of the biology of a newly sequenced genome is
inferred from the set of encoded proteins. Predicting this set is
therefore invariably the first step after the completion of the
genome DNA sequence.
The genome sequence is an organism's blueprint: the set of
instructions dictating its biological traits. The unfolding of these
instructions is initiated by the transcription of the DNA into RNA
sequences. According to the standard model, the majority of RNA
sequences originate from protein-coding genes; that is, they are
processed into messenger RNAs (mRNAs) which, after their export
to the cytosol, are translated into proteins.

•To Determine the protein folding
•Protein folding is the process by which a protein structure assumes
its functional shape or conformation.
•Protein folding is the physical process by which a protein chain
acquires its native 3-dimensional structure,
a conformation that is usually biologically functional, in an
expeditious and reproducible manner.
•It is the physical process by which a polypeptide folds into its
characteristic and functional three-dimensional
structure from random coil. Each protein exists as an unfolded
polypeptide or random coil when translated from a sequence
of mRNA to a linear chain of amino acids.

•All protein molecules are heterogeneous unbranched chains
of amino acids.
•By coiling and folding into a specific three-dimensional shape
they are able to perform their biological function.
•Proteins are formed from long chains of amino acids; they
exist in an array of different structures which often dictate
their functions. Proteins follow energetically favorable
pathways to form stable, orderly, structures; this is known as
the proteins’ native structure.
• Most proteins can only perform their various functions when
they are folded. Scientists believe that the instructions for
folding a protein are encoded in the sequence. Researchers
and scientists can easily determine the sequence of a protein,
but have not cracked the code that governs folding .

In Drugs production
What is Protein Drug
A type of drugs made of protein. These drugs usually have large molecular weight
with protein characteristics.
structure of an unusual class of proteins called beta-peptides. Eventually, these peptides
could become the basis for drugs that are cheaper to manufacture than existing protein-
based pharmaceuticals and last longer in the body. A drug's efficiency may be affected
by the degree to which it binds to the proteins within blood plasma. The
less bound a drug is, the more efficiently it can traverse cell membranes or diffuse.
Common blood proteins that drugs bind to are human serum albumin, lipoprotein,
glycoprotein, α, β‚ and γ globulins

Protein sequencing

More Related Content

What's hot

Similar to Protein sequencing

More from M Nadeem Akram

Recently uploaded

Protein sequencing