Friday (1/15) computer lab session: Location: 3073 (3rd floor), Department of Computational Biology, BST3, 3501 Fifth Avenue. Time: 9:30-10:45AM Play with R (tutorial) at home before the lab session.
Introduction to microarray
Motivation & previous techniques
Concept of biological pathway
Northern blot, RT-PCR and real time RT-PCR
Affymetrix microarray experiment
cDNA microarray experiment
Comparison of the two
Codelink, Illumina & Agilent
MAQC (Microarray Quality Control) Project
Introduction to next generation sequencing (RNA-seq, ChIP-seq etc)
The central dogma of molecular biology: DNA mRNA (messenger) rRNA (ribosomal) tRNA (transfer) Protein Ribosome transcription transcription transcription translation Microarray is a technology to globaly ( simultaneously detecting thousands of genes ) detect mRNA expression level.
Why detect expression level of protein or mRNA?
Cell cycle Cancer cells are malignant cells who don’t die but reproduce rapidly instead. Important to repair problematic mutations during cell division.
Example 1: p53 Pathway (an important tumor suppressor) Cancer cells are malignant cells who don’t die but reproduce rapidly instead. (DNA damaged) http://breast-cancer-research.com/content/pdf/bcr426.pdf
Prediction of a disease: If mechanism known, detecting expression level can help identifying cancer patients (e.g. unusual p53 or Kras expression activity). Exploratory: In general, microarray can help identify candidate genes that contribute to tumor progression and propose hypothesis of the underlying genetic network. Why detect expression level of protein or mRNA?
http://www.escience.ws/b572/L13/north.html Northern Blot (an old technique for measuring mRNA expression) mRNA extracted and purified. mRNA loaded for electrophoresis. Lane 1: size standards. Lane 2: RNA to be tested. The gel is charged and RNA “swim” through gel according to weight. - mRNA are transferred from the gel to a membrane. A labelled probe specific for the RNA fragment is incubated with the blot. So the RNA of interest can be detected. See next page for the details of this step. +
http://www.escience.ws/b572/L13/northupclose.html Norther Blot closeup (color staining) In this simplified cartoon, two mRNAs are bound on the membrane. The complement DNAs of A are prepared with label and are hybridized to all the mRNA on the membrane. The labeled complement DNA will bind to A but not B. After washing and detecting, abundance of the target mRNA can be seen.
See animation of RT-PCR: http://www.bio.davidson.edu/courses/Immunology/Flash/RT_PCR.html RT-PCR (reverse transcription-polymerase chain reaction) http://www.ambion.com/techlib/basics/rtpcr/ real-time RT-PCR
RNA is reverse transcribed to DNA.
PCR procedures can be used amplify DNA at exponential rate.
Gel quantification for the amplified product.
---- an semi-quantitative method. Smaller amount of sample needed.
The PCR amplification can be monitored by fluorescence in “real time”.
The fluorescence values recorded in each cycle represent the amount of amplified product.
---- a quantitative method. The current most advanced and accurate analysis for mRNA abundance. Usually used to validate microarray result.
Often used to validate microarray
Limitation of the old techniques
Can only detect up to dozens of genes.
Need to know the target sequences. For RT-PCR, at least need to know the primer to start the PCR.
Various microarrays A new view on genomic level
from Affymetrix Inc . Overview of the Affymetrix GeneChip technology
From experiments to analysis
Details of labeling and hybridization TACGTATTGCAAAA TTTTGCAATACGTA TACGTATTGCAAAA (at C and T)
Only Pyrimidines (C and T) have biotin labeled. This is where the color intensities come from.
The fragmentation makes the biotin-labeled cRNA shorter and helps efficiency of hybridization.
Sequence info of the target mRNA should be known so the complementary sequence can be prepared on the array.
multiple probes (11~16) for each gene from Affymetrix Inc . Array Design 25-mer unique oligo mismatch in the middle nuclieotide
from Affymetrix Inc . Needs at most 4 25=100 masking and coupling. Technology adapted from semiconductor industry. ( photolithography and combinatorial chemistry) Array Manufacturing
Chip Advances HG-U95 HG-U133 Set HG-U133 Plus 2.0 Array sequence source Build 95 UniGene database (Oct, 2, 1999??) Build 133 UniGene database (April, 20, 2001) Build 133 UniGene database (April, 20, 2001) Probe uniqueness 21/25 bases Two 8-mers including at least one 12-mer Two 8-mers including at least one 12-mer # of probes ~16 11 11 # of arrays 5 2 1 # of transcripts ~54000 genes HG-U95Av2: ~12000 HG-U95B-E: ~44000 EST ~33,000 genes ~38500 genes Feature size 20 µm 18 µm 11 µm
Few years ago, U95 set had 5 arrays. Normally only U95Av2 is used.
Improved probe selection algorithm to avoid non-specific binding.
Decreased # of probes in each probe set (20 => 11)
Smaller probe size
20 µm => 11 µm
More genes on each array and less cost
(Only one array for HG-U133 Plus )
Background adjustment Normalization Summarization
Give an expression measure for each probe set on each array
The result will greatly affect subsequent analysis (e.g. clustering and classification). If not modeled properly,
=> “Garbage in, garbage out”
Array Probe Level Analysis Normalization Background adjustment Summarization Details will be discussed in the next lecture.
Spotted cDNA microarray
From experiments to analysis
48 grids in a 12x4 pattern.
Each grid has 12x16 features (spots).
Total 9216 features (spots).
Each pin prints 3 grids.
Probe (array) printing
Probe design and printing
From Y. Chen et al. 1997 The experiment
From: http://www.techfak.uni-bielefeld.de/ags/ai/projects/microarray/ An image example Image analysis is more difficult than Affy array. The probes are spotted by robot instead of synthesized and the exact physical location is not known.
Comparison of cDNA array and GeneChip cDNA GeneChip Probe preparation Probes are cDNA fragments, usually amplified by PCR and spotted by robot. Probes are short oligos synthesized using a photolithographic approach. colors Two-color (measures relative intensity) One-color (measures absolute intensity) Gene representation One probe per gene 11-16 probe pairs per gene Probe length Long, varying lengths (hundreds to 1K bp) 25-mers Density Maximum of ~15000 probes. 38500 genes * 11 probes = 423500 probes
Affymetrix GeneChip One color design cDNA microarray Two color design Why the difference?
Affymetrix GeneChip Photolithography (The amount of oligos on a probe is well controlled) cDNA microarray Robotic spotting (The amount of cDNA spotted on a probe may vary greatly)
Advantage and disadvantage of cDNA array and GeneChip cDNA microarray Affymetrix GeneChip The data can be noisy and with variable quality Specific and sensitive. Result very reproducible. Cross(non-specific) hybridization can often happen. Hybridization more specific. May need a RNA amplification procedure. Can use small amount of RNA. More difficulty in image analysis. Image analysis and intensity extraction is easier. Need to search the database for gene annotation. More widely used. Better quality of gene annotation. Cheap. (both initial cost and per slide cost) Expensive (~$400 per array+labeling and hybridization) Can be custom made for special species. Only several popular species are available Do not need to know the exact DNA sequence. Need the DNA sequence for probe selection.
Other platforms of microarray
GE Codelink (out of market now)
Fig. End-point attachment orients the DNA while the polymeric coating holds it away from the surface of the slide, making the DNA readily available for hybridization. Codelink’s Gel-matrix
Comparisons cDNA GeneChip Codelink Agilent Probe preparation Probes are cDNA fragments, usually amplified by PCR and spot ted by robot. Probes are short oligos synthesized using a photolithographic approach. 3-D aqueous gel matrix Probes are print ed by Inkjet technology from HP colors Two-color (measures relative intensity) One-color (measures absolute intensity) One-color One- or two-color Gene representation One probe per gene 11-16 probe pairs per gene One probe per gene One probe per gene Probe length Long, varying lengths (hundreds to 1K bp) 25-mers 30-mers 60-mers Density Maximum of ~15000 probes. 38500 genes * 11 probes = 423500 ~57000 ~22000 probes Manufacturer Stanford and many labs. Affymetrix company GE company Agilent company
Mechanisms in microarray
Important mechanisms that make microarray work:
Reverse transcription: mRNA => cDNA. This is usually also the step to label dyes.
(Protein can not be reverse translated to mRNA or to another form. So difficult to label dyes.)
Double strand binding of complimentary DNA sequences.
(Protein does not enjoy such a good property; there are 20 amino acids without complementary binding)
Microarray Quality Control (MAQC) Project a series of papers published in Nature Biotechnology (Sep 2006)
Previous paper in NAR 2003
Evaluation of gene expression measurements from commercial microarray platforms. Tan et al. Nucleic Acids Research. 2003. 31:5676-5684.
Poor consistency made it a concern for precise science and routine clinical use.
Three commercial platforms were compared.
Inconsistent result found across platforms
7 microarray platforms; each platform implemented in 3 test sites; 4 pools of RNA each with 5 replicates were performed. (3*4*5=60 arrays for each platform)
The 4 pools of RNA are: A. 100%UHRR; B. 100%HBRR; C. 75%UHRR + 25%HBRR; D. 25%UHRR + 75%HBRR.
UHRR: Universal Human Reference RNA from Stratagene
HBRR: Human Brain Reference RNA from Ambion
3 RT-PCR based alternative gene expression platforms are also tested: TaqMan, StaRT-PCR and QuantiGene Assays.
NCI has only 2 test site. AGL has only 2 samples. Some problematic arrays are removed.
AGL is not included in this paper. A total of 386 arrays are analyzed.
Difficulties in comparing multiple platforms
Each platform has different probe design
Sensitivity and specificity of the probes. (some variability of cross-platform may be due to this annotation problem)
Database (NCBI RefSeq) often change, making it difficult to match.
Probes may bind to multiple alternative spliced transcripts, which may have different functions and expression patterns.
Kuo(2006): probe matching within one exon for Gas1 Gene matching across different platforms is not easy. Essentially each platform detects different targets.
Match genes across platforms
All probes mapped to RefSeq and AceView database.
Each platform assayed 15,429-16,990 Entrez genes.
23,971 in 24,157 RefSeq NM accessions assayed in at least on platform. Among them, 15,615 accessions (which correspond to 12,091 Entrez genes) were assayed in all platforms.
When multiple probes match to one RefSeq, only the probe closest to the 3’ end is used.
Finally each platform has 12,091 probes matching to a common set of 12,091 RefSeq from 12,091 different genes.
Number of detected genes called by manufactures’ software CV of 5 technical replicates
Blue: CV of 5 technical replicates Red: CV of all 15 replicates (5 technical replicates X 3 test sites)
Blue dot: percentage of genes concordantly called detected in each test site. Blue bar: percentage of genes concordantly called detected in all three test site.
Microarray provides an opportunity to measure thousands of genes simultaneously and make the global monitoring of cellular activities possible.
The method produces more noisy data and the choice of an adequate design and analysis is the key.
RT-PCR for validation of small number of genes.
Data obtained from different platforms and centers are consistent. Ready for routine clinical use.
The method measures mRNA instead of proteins. The actual protein abundance and post-translation modification can not be detected.
The method usually does not measure spatial or temporal dynamics of the cellular activity.
The method is suitable for global monitoring and should be used to generate further hypothesis or should combine with other carefully designed experiments.
Introduction to next generation sequencing
What is next generation sequencing?
Short reads (35~70 bps)
Comparing to traditional sequencing
No reference sequence available (ab initio)
Longer reads and additional linkage information required to assemble the entire sequence
Next Generation Sequencing
Reference sequence available (Sequenced by traditional sequencing)
No need of assembly, just map the short reads back to the reference sequence.
ChIP-Seq (Chromosome Immunoprecipitation)
A substitute for ChIP-chip
To find the binding sequence of proteins (TFBS)
A substitute for Microarray
To measure the amount of RNA expressed
Comparing to microarray
Closed technology: Prior knowledge required
Affected by pseudo-genes (homologous of real genes)
Cheap and mature
Open technology: No prior knowledge required
Not affected by pseudo-genes because exact sequence is measured
Other information could be yielded (SNP, Alternative splicing)