Systems biology for Medicine' is 'Experimental methods and the big datasets
1. “A cross-border region where rivers connect, not divide” –
Interreg V-A Hungary-Croatia Co-operation Programme 2014-2020
1
The next topic in 'Systems biology for Medicine' is 'Experimental methods and the big datasets'
2
You probably remember from the first lecture that systems biology often uses omics type of
experiment. Also, omic studies are considered to be 'the hypothesis generating studies'. But in order
to classify some study into system biology type, experiments do not need necessarily to be omicscale
and study itself can be based on hypothesis.
Let's consider experiment focused on subsystems like e.g. tracking relevant mRNA for energy
metabolism under different feeding regime, or at multiple time points from onset of feeding. What in
design of experiment would turn the study into a system biology one?
Two major criteria should be satisfied:
1. Study has to collect quantitative data - e.g. to answer on questions how much or how fast
2. It should enable us to build a computational model based on experimental data or in parallel with
experimental data.
Still, the systems biology studies are not different from any other scientific studies in respect to
requirement of negative and positive controls and reproducibility.
3
Methods in system biology cover both - macromolecules and small molecular mass molecules. The
range of nucleotide polymers is investigated by two types of techniques - microarray or sequencing
based technologies. Range of proteins is investigated by mass spectrometry while lipids, glycans
and metabolites require additional step of separation by liquid chromatography.
4
The biological sample is commonly a piece of tissue or population of cells from the cell culture. The
size of sample is usually 1 million or more cells. Due to cell variability result of the study is an average
over many cells what can generate a large experimental error.
There are techniques which are using single cell or are based on clones of one cell. In that case we
can expect experimental error due to cell variability.
5
The microarray technique is based on Southern blot technique and DNA hybridization - DNA probes
which represent genes or some other places in genome are printed on solid surface such as glass,
plastic or silicon biochip.
2. “A cross-border region where rivers connect, not divide” –
Interreg V-A Hungary-Croatia Co-operation Programme 2014-2020
The technique uses two samples in parallel - one sample is control while the other one is
experimental. The samples are differently labelled with fluorophores (usually red and green) and
applied at array of printed oligonucleotides. After the step of hybridization and washing some
labelled probes remain bound to printed oligonucleotide and emitted light of certain wavelength
upon excitation by laser. A laser scanner measures intensity of emitted light at different spots.
Finally, computer calculates ratio of red to green intensity at each spot. The ration is interpreted as
increase or decrease in expression if sample were cDNAs and if we were determining differential
expression, what is the most common use of array. The same techniques could be used for
genotyping (e.g. detection of SNPs), DNA mapping (copy number variation), DNA methylation and so
on.
GEO database is a huge collection of studies done by microarray.
6
Comparative genomic hybridization is a version of array technique especially developed for
cytogenetic use, also is known as molecular cytogenetic.
Technique is based on assumption that 2 samples from closely related individuals (a healthy one and
sick one) differ in gain or loss of chromosome or gain/loss of a chromosomal region. Technique is
particularly suitable for detection of aneuploidy and a large-scale analysis of tumour-specific genome
rearrangements.
OMIM data base contains results from a number of CGH studies and gives an overview of human
disease genes.
7
Epigenetic regulation of gene expression has a power to change destiny of a cell and depends on
impact from environment. CpG methylation is one of many forms of epigenetic modifications.
Microarray methylation assay is a technique for measurement of DNA methylation pattern on a large
scale and gives useful information about differentiation, tumorogenesis, disease progression or
efficiency of therapy.
The first step in method is bisulfite conversion of sample in which all unmethylated CpG sites convert
C into U, while methylated sites lose methylation. The next step is amplification when all U sites are
exchanged for T, followed by fragmentation with enzyme, denaturing and hybridization with the two
types of allele specific beads for the each locus. Annealing of denatured oligonucleotides and
beadattached oligonucleotide probes is allele specific. After successful annealing, oligonucleotide
probe is extended with single fluorescently labelled dDNT. Finally, software calculate relative
fluorescence of each locus and grades it as 0, 0.5 or 1 (homozygous unmethylated, heterozygous, or
homozygous methylated).
8
3. “A cross-border region where rivers connect, not divide” –
Interreg V-A Hungary-Croatia Co-operation Programme 2014-2020
Microarray methods rely on successful hybridization between oligonucleotides from sample and
oligonucleotides attached to microchip. Method is limited with number of oligonucleotides that
represent certain loci in a genome, in fact with microchip design.
Sequencing based methodes do not have this limitation and can test any number of previously
unknown DNA or RNA sequences. Method starts with DNA or RNA isolation followed by either DNA
fragmentation or cDNA synthesis. Next step is amplification which can be done on many different
ways. Following step is parallel sequencing of many short DNA fragments (also can be done on many
different ways) and all finishes with assemble of contiguous fragments - what is performed
computationally.
9
Currently in use are two genome sequencing approaches - whole genome sequencing and
wholeexome sequencing. While whole genome sequencing attempts to sequence entire genome,
wholeexome sequencing is biased - directed toward the part of genome that is transcribed. Also,
wholeexome sequencing is quicker, cheaper, has 'higher depth' (more reads per base pair, sequences
of interest are read many times) and is easier to analyse.
10
A key difference between whole -genome and whole-exome sequencing is also an enrichment step in
which are selected targeted DNA fragments from entire genome - those that are transcribed.
Separation is done by hybridization, and microarrays are one way to do it. Because Mendelian
disorders often disrupt protein-coding regions, exome is a good source of rare disease variants.
11
Despite obvious advantages of whole-exome sequencing, whole genome sequencing becomes a
preferable method. One reason is a huge drop in price of sequencing, even higher than expected
according to Moors' law. On the other hand the whole-genome sequencings have obvious advantage
in detection of tumour-specific rearrangements.
The first generation of sequencing method was based on Sanger's idea of so called 'chain termination
technology' - instead of sequencing by degradation of original chain, ddNTP were used to terminate
chain synthesis in orderly way. Method was accurate, but slow and expensive. The second (or next)
generation of sequencing method was also sequencing by synthesis, but parallel sequencing,
introduction of nano-technology and skip of separation step significantly improved performance. The
method was quick and inexpensive but less suitable for the long fragments. This limitation is
corrected in the third generation sequencing which is based on detection of fluorescent dNTPs during
incorporation by immobilized polymerase and excellent optics.
4. “A cross-border region where rivers connect, not divide” –
Interreg V-A Hungary-Croatia Co-operation Programme 2014-2020
The last two generations of sequencing are using repeated sequencing of DNA fragments - so called
deep sequencing - what increases sensitivity and accuracy.
12
The data coming out of whole-genome or whole-exome sequencing need further verification through
data about a large population in order to be sure that certain genomic variant is associated or
notassociated with appearance of disease. The genome-wide association studies assess the risk of
gene variants which segregate with families that carry certain disease.
13
RNA-Seg or whole transcriptome shotgun sequencing is quantitative method that uses next
generation sequencing in order to sequence parts of genomes that are in use - transcriptome. Total
or filtered RNA (just coding or non-coding sequences) is converted by reverse transcriptase into
cDNA, sequenced and mapped onto reference genome. Because conversion of RNA into cDNA can
introduce biases, single molecule Direct RNA Sequencing (DRSTM) technology is under development.
Either of these two methods is ideal for systems biology because is quantitative and able to track
changes that occur due to environmental conditions or time like alternative splicing,
posttranscriptional modifications, gene fusions, mutations - changes in gene expression in general.
14
RNA-sequencing almost entirely replaced microarray which were previously dominant method for
determining transcriptome due to ability of this method to distinguish different isoforms, distinguish
allelic expression, because of single base resolution, low amount of required RNA and relatively low
cost.
15
ChIP-seq, Chromatin immunoprecipitation sequencing is one more sequencing method that differs
from all others in selection of DNA fragments . The first step in this method is co-precipitation of
transcription factors or other DNA binding proteins and DNA fragments by using specific antibodies.
Method is also sensitive to histon modifications. After purification, selected fragments are subjected
to deep sequencing. This method is suitable for finding functional elements in genome - like
regulatory sequences, promoters, enhancers, silencers, splicing sites and so on.
16
5. “A cross-border region where rivers connect, not divide” –
Interreg V-A Hungary-Croatia Co-operation Programme 2014-2020
As we previously said, methods suitable for system biology have to satisfy two criteria - be capable to
measure many entites at once an to be quantitative - and these criteria are satisfied in the case of
the most DNA and RNA sequencing methods. Western blot was the first method toward
quantification and identification of more than one protein, but due to step of visualisation method is
considered semi-quantitative. The main problem is non-linear kinetics behind enzyme labeling of
secondary antibody and also behind chemiluminiscence reaction and X-ray film exposure. Recently,
problem is solved by LICOR technology which is using secondary antibodies labeled with IRDye
nearinfrared (NIR) fluorescent dyes. These NIR fluorescent signals are stable for months – no
enzymes or substrates are used, bacground is minimal and signal is proportional to amount of
antibody.
17
LICOR solved just the problem of quantification, but Western blot is still low throughput method
because it measures very limited set of proteins - about 20 in a sample that contains few thousands
more different proteins.
Forward and reverse phase protein array methods are step forward measuring many entities at once.
Both methods are based on highly specific antibodies. In the forward method one antibody is
attached on slide and many samples are probed on it. In reverse phase protein array many antibodies
are attached to slide and just one sample is probed for the presence of selected proteins. Because
these methods require highly specific antibodies they are very expensive.
18
Another high throughput method for protein analysis is mass spectrometry. Method exist in many
variants and involves many steps: separation, digestion, enrichment, repeated separation, ionization,
mass filtering (MS1), fragmentation and mass analysis (MS2), identification, quantification. At the
end we get molecular mass of many entities that comprise sample and we compare them with the
database of know peptides to identify each. Method is versatile and used with some modifications
not just for proteomics but also for lipidomics, glycomics and metabolomics.
19
The final step in mass spectrometry is comparison of data obtained for given sample with data in a
database. Each omics relies on different database and bioinformatics search for matches. Different
mass spectrometry methods differ in throughput - for example metabolomics studies can measure
from few hundred till few thousands of different entities at once.
20
6. “A cross-border region where rivers connect, not divide” –
Interreg V-A Hungary-Croatia Co-operation Programme 2014-2020
One can say that transcriptome and proteome measure same, but they are in fact complementary.
Some very important discoveries came from proteome studies, for example proteomics analysis
decoded difference between MAPK pathway in normal and tumour cell. Proteomics studies are
sensitive to postranslational modification like phosphorylation and glycosilation, which are
particularly important in signalling pathways and activation or deactivation of enzymes.
Transcriptome studies are blind to such modifications.
21
Liquid chromatography is often used in combination with mass spectrometry. If used together, liquid
chromatography precedes mass spectrometry and it is used as separation step for MS. Also, by using
LC/MS, lipidomics and metabolomics could be done in sequence on the same samples what increases
throughput of method.
22
In both, genomic and proteomic studies, we can expect variations in experimental data due to
different sources of variation. In genomic studies source of variation is due to: Biological
variation between samples (particularly if sample is tissue)
Number of probes representing a gene - we have to appreciate that the appropriate sequence
presenting a gene is sometimes complicated to find.
Cross reactivity of the probe
Different manufacturers of microchips
Application of software for ‘removal’ of technical artefacts In
proteomic studies source of variation is due to:
Biological variation between samples ( and sample preparation)
Number of peptides per protein produced by enzymatic degradation
Overrepresentation of highly abundant proteins
Appropriate data analysis – what relies on improvement of search engines and data bases
23
In summary - Systems biology experiments collect the big data using a high throughput methods. In
the case of DNA appropriate methods are microarray and NGS. Result of such analysis is genome or
exome. If we would like to investigate regulatory elements in genome methode of choise is ChIPseq
and result is ReMap. Coding and non-coding RNAs we can investigate by RNA-seq in order to get
transcriptome. The methode of choice for proteins is MS, while lipidome, metabolom and glycome
we can get by using LC/MS.