The document discusses quality control, filtering, and normalization procedures for Illumina 450k methylation array data. It describes initial quality control checks to identify failed samples and technical artifacts, such as color biases. A variety of normalization approaches are presented, including within-array normalization to correct for color bias and background noise, between-array normalization to remove technical variation across arrays, and data-driven approaches to evaluate different preprocessing methods. The goal of preprocessing is to improve concordance with independent validation data while retaining meaningful biological variation.
SURVEY ON MODELLING METHODS APPLICABLE TO GENE REGULATORY NETWORKijbbjournal
Gene Regulatory Network (GRN) plays an important role in knowing insight of cellular life cycle. It gives
information about at which different environmental conditions genes of particular interest get over
expressed or under expressed. Modelling of GRN is nothing but finding interactive relationships between
genes. Interaction can be positive or negative. For inference of GRN, time series data provided by
Microarray technology is used. Key factors to be considered while constructing GRN are scalability,
robustness, reliability and maximum detection of true positive interactions between genes. This paper
gives detailed technical review of existing methods applied for building of GRN along with scope for
future work.
siRNA has become an indispensible tool for silencing gene expression. It can act as an antiviral agent in RNAi pathway against plant diseases caused by plant viruses. However, identification of appropriate features for effective siRNA design has become a pressing issue for researchers which need to be resolved. Feature selection is a vital pre-processing technique involved in bioinformatics data set to find the most discriminative information not only for dimensionality reduction and detection of relevance features but also for minimizing the cost associated with features to design an accurate learning system. In this paper, we propose an ANN based feature selection approach using hybrid GA-PSO for selecting feature subset by discarding the irrelevant features and evaluating the cost of the model training. The results showed that the performance of proposed hybrid GA-PSO model outperformed the results of general PSO.a
SURVEY ON MODELLING METHODS APPLICABLE TO GENE REGULATORY NETWORKijbbjournal
Gene Regulatory Network (GRN) plays an important role in knowing insight of cellular life cycle. It gives
information about at which different environmental conditions genes of particular interest get over
expressed or under expressed. Modelling of GRN is nothing but finding interactive relationships between
genes. Interaction can be positive or negative. For inference of GRN, time series data provided by
Microarray technology is used. Key factors to be considered while constructing GRN are scalability,
robustness, reliability and maximum detection of true positive interactions between genes. This paper
gives detailed technical review of existing methods applied for building of GRN along with scope for
future work.
siRNA has become an indispensible tool for silencing gene expression. It can act as an antiviral agent in RNAi pathway against plant diseases caused by plant viruses. However, identification of appropriate features for effective siRNA design has become a pressing issue for researchers which need to be resolved. Feature selection is a vital pre-processing technique involved in bioinformatics data set to find the most discriminative information not only for dimensionality reduction and detection of relevance features but also for minimizing the cost associated with features to design an accurate learning system. In this paper, we propose an ANN based feature selection approach using hybrid GA-PSO for selecting feature subset by discarding the irrelevant features and evaluating the cost of the model training. The results showed that the performance of proposed hybrid GA-PSO model outperformed the results of general PSO.a
Introduction
Transcriptome analysis
Goal of functional genomics
Why we need functional genomics
Technique
1. At DNA level
2.At RNA level
3. At protein level
4. loss of function
5. functional genomic and bioinformatics
Application
Latest research and reviews
Websites of functional genomics
Conclusions
Reference
Identifying the Coding and Non Coding Regions of DNA Using Spectral AnalysisIJMER
This paper presents a new method for exon detection in DNA sequences based on multi-scale parametric spectral analysis Identification and analysis of hidden features of coding and non-coding regions of DNA sequence is a challenging problem in the area of genomics. The objective of this paper is to estimate and compare spectral content of coding and non-coding segments of DNA sequence both by Parametric and Non-parametric methods. In this context protein coding region (exon) identification in the DNA sequence has been attaining a great interest in few decades. These coding regions can be identified by exploiting the period-3 property present in it. The discrete Fourier transform has been commonly used as a spectral estimation technique to extract the period-3 patterns present in DNA sequence. Consequently an attempt has been made so that some hidden internal properties of the DNA sequence can be brought into light in order to identify coding regions from non-coding ones. In this approach the DNA sequence from various Homo Sapiens genes have been identified for sample test and assigned numerical values based on weak-strong hydrogen bonding (WSHB) before application of digital signal analysis techniques.
This lecture outlines the different strategies for finding a fragment hit and the subsequent elaboration strategies used in order to increase potency to develop a lead compound in drug discovery.
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...TELKOMNIKA JOURNAL
Problem in imbalanced data is very common in classification or identification. The problem is
raised when the number of instances of one class far exceeds the other. In the previous research, our
DNA barcode-based Identification System of Tuna and Mackerel was developed in imbalanced dataset.
The number of samples of Tuna and Mackerel were much more than those of other fish samples.
Therefore, the accuracy of the classification model was probably still in bias. This research aimed at
employing Synthetic Minority Oversampling Technique (SMOTE) to yield balanced dataset. We used kmers
frequencies from DNA barcode sequences as features and Support Vector Machine (SVM) as
classification method. In this research we used trinucleotide (3-mers) and tetranucleotide (4-mers). The
training dataset was taken from Barcode of Life Database (BOLD). For evaluating the model, we compared
the accuracy of model using SMOTE and without SMOTE in order to classify DNA barcode sequences
which is taken from Department of Aquatic Product Technology, Bogor Agricultural University. The results
showed that the accuracy of the model in the species level using SMOTE was 7% and 13% higher than
those of non-SMOTE for trinucleotide (3-mers) and tetranucleotide (4-mers), respectively. It is expected
that the use of SMOTE, as one of data balancing technique, could increase the accuracy of DNA barcode
based fish classification system, particularly in the species level which is difficult to be identified.
Introduction
Transcriptome analysis
Goal of functional genomics
Why we need functional genomics
Technique
1. At DNA level
2.At RNA level
3. At protein level
4. loss of function
5. functional genomic and bioinformatics
Application
Latest research and reviews
Websites of functional genomics
Conclusions
Reference
Identifying the Coding and Non Coding Regions of DNA Using Spectral AnalysisIJMER
This paper presents a new method for exon detection in DNA sequences based on multi-scale parametric spectral analysis Identification and analysis of hidden features of coding and non-coding regions of DNA sequence is a challenging problem in the area of genomics. The objective of this paper is to estimate and compare spectral content of coding and non-coding segments of DNA sequence both by Parametric and Non-parametric methods. In this context protein coding region (exon) identification in the DNA sequence has been attaining a great interest in few decades. These coding regions can be identified by exploiting the period-3 property present in it. The discrete Fourier transform has been commonly used as a spectral estimation technique to extract the period-3 patterns present in DNA sequence. Consequently an attempt has been made so that some hidden internal properties of the DNA sequence can be brought into light in order to identify coding regions from non-coding ones. In this approach the DNA sequence from various Homo Sapiens genes have been identified for sample test and assigned numerical values based on weak-strong hydrogen bonding (WSHB) before application of digital signal analysis techniques.
This lecture outlines the different strategies for finding a fragment hit and the subsequent elaboration strategies used in order to increase potency to develop a lead compound in drug discovery.
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...TELKOMNIKA JOURNAL
Problem in imbalanced data is very common in classification or identification. The problem is
raised when the number of instances of one class far exceeds the other. In the previous research, our
DNA barcode-based Identification System of Tuna and Mackerel was developed in imbalanced dataset.
The number of samples of Tuna and Mackerel were much more than those of other fish samples.
Therefore, the accuracy of the classification model was probably still in bias. This research aimed at
employing Synthetic Minority Oversampling Technique (SMOTE) to yield balanced dataset. We used kmers
frequencies from DNA barcode sequences as features and Support Vector Machine (SVM) as
classification method. In this research we used trinucleotide (3-mers) and tetranucleotide (4-mers). The
training dataset was taken from Barcode of Life Database (BOLD). For evaluating the model, we compared
the accuracy of model using SMOTE and without SMOTE in order to classify DNA barcode sequences
which is taken from Department of Aquatic Product Technology, Bogor Agricultural University. The results
showed that the accuracy of the model in the species level using SMOTE was 7% and 13% higher than
those of non-SMOTE for trinucleotide (3-mers) and tetranucleotide (4-mers), respectively. It is expected
that the use of SMOTE, as one of data balancing technique, could increase the accuracy of DNA barcode
based fish classification system, particularly in the species level which is difficult to be identified.
Human identification from DNA is typically based
on 13 short-tandem repeat (STR) alleles. Commercial kits used in forensic casework rely on the detection of these alleles in DNA samples acquired from an individual. However, the process itself is slow (it can take up to 2 days when conducting a laboratory analysis or 1 hour when using Rapid DNA systems) and has been designed to operate on pristine DNA samples. The need for
achieving fast and accurate DNA processing has spurred efforts in developing portable systems that can reduce the processing time to less than 1 hour. But such systems are expected to operate on degraded DNA samples due to the architecture and process used by the instrument. Consequently, detecting the alleles in such degraded DNA samples can be a challenging problem. In this paper, we present an algorithm to detected allelic peaks from degraded DNA signals based on an adaptive signal processing scheme.
Micro array based comparative genomic hybridisation -Dr Yogesh DDr.Yogesh D
This is a brief introduction to the technique and principle of Array Comparative Genomic Hybridization. Array CGH is a powerful tool for genetic testing and has been enormously useful in cancer cytogenetics, prenatal genetic testing etc.
Target enrichment enables researchers to focus their next generation sequencing (NGS) efforts on regions of interest, allowing them to obtain more sequencing data relevant to their study. In-solution target capture is a method of enrichment using oligonucleotide probes directed to specific regions within a genome. Target capture can be used to enrich multiple samples simultaneously, reducing the cost per sample, while using individually synthesized probes allows researchers to construct gene panels that can be optimized over time.
Genome walking – a new strategy for identification of nucleotide sequence in ...Dr. Mukesh Chavan
Identification of unknown nucleotide sequences flanking already characterized DNA regions can be pursued by number of different PCR- based methods commonly known as Genome walking (GW)
GW methods have been developed in the last 20 years, with continuous improvements added to the first basic strategies
First reported by Hengen in 1995 in comparison with other technologies
Hui et al., in 1998 reviewed in detail
The extreme flexibility of GW strategies makes its application possible in every standardly equipped research laboratory. In addition, the possibility of merging GW strategies to next generation sequencing approaches will undoubtedly extend the future application of this by now basic technique of molecular biology.
What is PCR?
History of PCR
Components of PCR
Principles of PCR
Basic Requirements
Instrumentation
PCR Programme
Advantages of PCR
Applications of PCR
Conclusion
References
This PPT shows the general information about PCR principles and gene expression analysis. It might be useful for researchers, students working in the field of molecular biology and genomics.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Studia Poinsotiana
I Introduction
II Subalternation and Theology
III Theology and Dogmatic Declarations
IV The Mixed Principles of Theology
V Virtual Revelation: The Unity of Theology
VI Theology as a Natural Science
VII Theology’s Certitude
VIII Conclusion
Notes
Bibliography
All the contents are fully attributable to the author, Doctor Victor Salas. Should you wish to get this text republished, get in touch with the author or the editorial committee of the Studia Poinsotiana. Insofar as possible, we will be happy to broker your contact.
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxRASHMI M G
Abnormal or anomalous secondary growth in plants. It defines secondary growth as an increase in plant girth due to vascular cambium or cork cambium. Anomalous secondary growth does not follow the normal pattern of a single vascular cambium producing xylem internally and phloem externally.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
4. Oligos
(~800,000/bead)
Bead
Type I: two bead types per
CpG site (unmethylated and
methylated)
Type II: one bead type per
CpG site
DNA preparation,
hybridisation, staining.
Detection of red and green
fluorescent signals by iScan
450K Array
Beads self-
assemble into
the pits on the
array
450k chip: 12
samples per chip
Multiple beads
per bead type
on each chip
Combined to
bead pools (all
bead types)
5. ● Analyzes >480,000 CpG loci
● Covers 99% of all RefSeq genes,
average of 17 probes per gene
● Distributed over various
functional elements such as:
○ CpG islands, shores, and
shelves
○ 3´- and 5´-UTRs, gene
bodies
○ DNAse hypersensitive sites
○ miRNA promoters
Dedeurwaerder et al. Epigenomics (2011)
450K Array
6. 450K Array: QC Probes
Control Probe Purpose
Staining measure efficiency and sensitivity of staining step (independent of
hybridisation/extension step)
Extension test efficiency of extension of A, T, C and G nucleotides from a hairpin
probe ( sample-independent). The perfect match hairpin controls
should result in high signal, and the mismatch probes in low signal
Hybridization test the overall performance of Infinium assay using synthetic targets
(not DNA) at 3 concentrations
Target removal test efficiency of stripping step after extension
Bisulphite conversion test efficiency of bisulphite conversion by query of C/T polymorphism
Specificity controls check for non-specific detection of methylation signal over
unmethylated background. Specificity controls are designed against
non-polymorphic T sites (G/T mismatch)
Non-polymorphic query a non polymorphic base A, T, C and G to test overall
performance of the assay from amplification to detection
Negative randomly permutated bisulphite-converted sequences containing no
CpGs. They should not hybridise to DNA. The mean of these probes
determines the system background
7. Type I Type II
Same chemistry as 27k array Chemistry not seen on 27k array
28% of probes on array 72% of probes on array
Designed for regions with more CG
dinucleotides - 57% of Type I probes
lie in CpG islands
21%, 26% and 11% of type II probes
lie in CpG islands, shores and shelves
respectively
Suggested to be more stable and
reproducible than the signals
provided by Type II probes
Have a decreased quantitative
dynamic range compared to Type I
probes
For either probe design, intensities are used to estimate Beta-value or M-value
450k Array: Type I vs Type II
Probes
8. Uses fluorescence from two different probes, unmethylated (converted) and
methylated (unconverted), to assess the level of methylation of a target CpG
● Binding at either probe is followed by single base extension that results in the
addition of a fluorescently labeled nucleotide
Dedeurwaerder et al. Epigenomics (2011)
450K Array: Type I Probes
9. Methylation state is detected by single base extension and detection of a fluorescently
labelled nucleotide at the position of the 'C' of the target CpG
● Type II probes include a 'degenerate' R-base at any underlying CpG sites in the
probe body
Dedeurwaerder et al. Epigenomics (2011)
450K Array: Type II Probes
11. ● There are a number of R packages that incorporate QC,
filtering and normalisation into their pipelines or offer
specific functions
● Most allow you to define at least some thresholds yourself
● Option to pick and choose
minfi ChAMP
RnBeads lumi
Touleimat &
Tost
wateRmelon
Outline
12. AIM: identify unusual samples and technical artifacts
The array contains 65 single nucleotide polymorphisms (SNPs)
● We can use this info to identify any unintentional duplicated samples
● If we have multiple samples per individual, their samples should cluster
Raw Data: Initial QC
13. AIM: identify unusual samples and technical artifacts
The array uses red/green fluorescence intensities to estimate methylation level -
the two colour channels have different background intensities
● Type I probes use the same colour to evaluate methylated and unmethylated
probes - should have less of an impact
● Type II uses green to measure methylated state and red for unmethylated -
colour bias can contribute to decreased dynamic range
Raw Data: Initial QC
14. AIM: identify unusual samples and technical artifacts
As the fluorescence intensities are read across the chip - there
appears to be a tiering effect
● This technical artifact could impact the results
○ e.g. cases unintentionally clustered on the array
Raw Data: Initial QC
15. Plot Distributions of the samples
● Unusual distributions may reflect:
○ Real biological effects (global methylation changes)
○ Poor methylation data
AIM: identify unusual samples and technical artifacts
Different colour/line combination
for each sample
Red indicates primary tumor, blue is
adjacent normal tissue
Raw Data: Initial QC
16. Plot Distributions of the samples
● Boxplots or violin plots (below) can be used to the same effect
AIM: identify unusual samples and technical artifacts
Raw Data: Initial QC
17. AIM: identify unusual samples and technical artifacts
Colour corresponds to tumor status Colour corresponds to TCGA batch
Can use multidimensional scaling to look for unusual clustering of samples
Raw Data: Initial QC
18. By plotting the distribution of Type I and Type II probes separately we can
observe the difference in distribution - example four samples below
● Reflects: difference in chemistry and enrichment for different elements
(e.g. CpG islands)
AIM: identify unusual samples and technical artifacts
Raw Data: Initial QC
19. ● Each data point has an associated detection p-value
○ Represents the probability the target signal was distinguishable
against background noise
● Scanner can encounter difficulties reading signal
○ Low staining intensities
○ Spatial artefacts
Common approaches:
● Drop probes that failed in nth% of samples
○ Common thresholds are 20%, 10%, 5% of probes at >0.05, >0.01.
● Drop samples that failed in nth% of probes
○ Common thresholds are 50%, 20% at >0.05, 0.01.
Filtering: Detection P-value
20. Drop those with known SNPs
residing in the probe sequence
Most common SNPs in dbSNP
are C>T transitions
● C>T transitions will be read as
an unmethylated cytosine
Observe grouping of methylation
values by genotype
Filtering: Common Practices
Related to Technical Issues
21. ● ~4.3% of the probes are reported
to contain a known
polymorphism specifically at the
targeted C or G
○ 43% of these SNPs have a
heterozygosity of >0.1
○ Price et al. Epigenetics
Chromatin (2013)
● SNP filtering depends on the
study population / reference
genome (eg CEU)
Drop those for which the CpG site contains a SNP
Filtering: Common Practices
Related to Technical Issues
22. Drop those in which probes anneal to multiple genomic locations
● Bisulfite conversion reduces the complexity of the genome
All unmethylated Cs converted to T
● ~10-20% of the Infinium HumanMethylation450 probes have been
identified as non-specific depending on the criteria
● Repetitive elements - may be a real signal, but uncertain meaning and
validation difficult
● Probes cross-reactive to X chromosome
○ pick up X inactivation, leading to spurious association if
outcome/exposure associated with sex
Naeem et al. BMC Genomics (2014)
Filtering: Common Practices
Related to Technical Issues
23. Common practices related to analysis issues
● drop those on X and Y chromosomes
● drop those with lowest variation
● drop those with extreme methylation levels (eg median = 0% or 100%)
● only consider those in regions of interest (eg CpG island, shore, other)
Filtering: Common Practices
Related to Technical Issues
24. Colour bias adjustment and Background correction
● Can adjust for colour bias in the lumi package using either:
○ Smooth quantile or shift-and-scaling normalisation
● Most methods employ simple background subtraction
○ No significant improvement in data quality
● New method in methylumi package outperforms previous methods
○ Uses “out-of-band” signal from type I probes to estimate background
rather than background control probes
■ Out-of-band - colour channel opposite their designed base extension
■ Only a few background control probes (n=614), but many Type I probes
Triche et al. Nucleic Acids Res. (2013)
Normalization: Within-Array
25. Colour bias adjustment and Background correction
Lumi approach, requires starting with a MethyLumiM object
● How to perform the quantile normalisation method
Methylumi approach from Triche et al. Nucleic Acids Res. (2013), requires starting with a
MethylumiSet object
● Using ‘noob’ method (noob=normal-exponential using out-of-band probes), a
convolution that assumes
data.bgcorrect<-adjColorBias.quantile(lumidata)
Signal Intensity + Background = Observed foreground intensity
data.bgcorrect<-methylumidata(methylumidata)
Normalization: Within-Array
26. Probe type correction
● The identified shift between type I and type II β-values may induce a bias
in the analysis if the methylation signals corresponding to the two types
of assays are analyzed together
○ type I probes have greater stability, increased power
● Can’t just perform full quantile normalisation
○ The population to ‘correct’ (type II) is the larger group: may bias
distribution of type I probes
○ Each probe type covers different CpG and gene-sequence regions
Normalization: Within-Array
27. Probe type correction options
Subset quantile normalisation Touleimat & Tost, Epigenomics (2012)
● For each probe category, use type I signals as the anchors to estimate a
reference distribution of quantiles
● use this reference to estimate a target distribution of quantiles for type II
probes
○ two different annotations for subsetting:
■ ‘relation to CpG’
■ ‘relation to gene sequence’
Normalization: Within-Array
28. Probe type correction options
Subset-quantile Within Array Normalisation (SWAN) Maksimovic et al.
Genome Biology (2012)
● Assume that the overall intensity
distribution should be the same
when the underlying CpG contents
of the probes are the same
○ in other words, assume the
CpG content of the probes
reflects the biology by being a
surrogate for the CpG density
of the region
Normalization: Within-Array
29. Probe type correction options
Beta-mixture quantile (BMIQ) normalisation method. Teschendorff et al.
Bioinformatics (2013)
● Major benefit over subset normalisation methods is that it is assumption free
○ State membership of individual probes is determined by maximum
probability
● Approach:
○ Fits a three-state (unmethylated, hemimethylated, fully methylated) beta
mixture model to the Type I and Type II probes separately
○ For each state, transforms probabilities of belonging to the state to
quantiles using the inverse of the cumulative beta distribution with beta
parameters estimated from the Type I probes
● Model-based method helps to avoid having gaps emerge in normalized
distribution
Normalization: Within-Array
32. ● Aim is to remove other technical artifacts eg position tiering of intensities
○ reflects quantile normalisation approaches for gene expression
● Normalisation of intensities (not betas)
● Assumes the same global distributions between the samples
○ this may not be true
BEFOREAFTER
Normalization: Between-Array
33. Normalization: Between-Array
Functional Normalization of 450k Methylation Array. Fortin et al. Genome
Biology (2014)
● Created to offer a means of removing between-array unwanted technical variation
even in the case of global methylation changes
● Possible to apply to cancer data or tissue differences
● Uses control probes to act as surrogates for unwanted variation
● Control probes: uses 848 control probes
● None are used to measure biological signal
● PCA of control probes, removes variation associated with first two PCs by default
● Functional normalization extends quantile normalization by adjusting for known
covariates measuring unwanted variation
● The normalization procedure is applied to the Meth and Unmeth intensities
separately, and to type I and type II signals separately
34. Normalization: Between-Array
Functional Normalization of 450k Methylation Array. Fortin et al. Genome
Biology (2014)
● Like a unsupervised batch correction method
● Suggested to outperform supervised batch correction methods (e.g.
ComBat, SVA and RUV)
35. Normalization: Data-driven
Approaches
A data-driven approach to preprocessing Illumina 450K methylation array
data. Pidsley et al. BMC Bioinformatics (2013)
● Use three independent metrics to based on known methylation patterns to
test the performance of different normalization and background correction
schemes
● Assess patterns associated with:
● Genomic imprinting
● ‘DMRSE’ (i.e. Differentially Methylated Regions Standard Error)
● X-chromosome inactivation (XCI)
● ‘Seabird’ (named after the auk and also the mythical bird roc)
● SNP genotyping assays present on the array
● ‘GCOSE’ (Genotype Combined Standard Error)
36. Normalization: Data-driven
Approaches
A data-driven approach to preprocessing Illumina 450K methylation array
data. Pidsley et al. BMC Bioinformatics (2013)
Method Background adjustment Between-array normalization Dye bias correction
naten no typ1 and typ2 together no
nanet no no typ1 and typ2 together
nanes no no typ1 and typ2 separately
danes yes no typ1 and typ2 separately
danet yes no typ1 and typ2 together
danen yes no no
daten1 yes typ1 and typ2 together no
daten2 yes typ1 and typ2 together no
nasen no typ1 and typ2 separately no
dasen yes typ1 and typ2 separately no
37. Normalization: Data-driven
Approaches
method TypeI TypeII Average
raw 6.5 11 8.75
betaqn 14 13 13.5
naten 12 9 10.5
nanet 11 3 7
nanes 9.5 7.5 8.5
danes 2.5 7.5 5
danet 1 6 3.5
danen 5 12 8.5
daten1 4 4 4
daten2 8 5 6.5
nasen 9.5 1.5 5.5
dasen 2.5 1.5 2
fuks 6.5 15 10.75
tost 13 14 13.5
swan 15 10 12.5
Tested 15 pre-processing
methods across 11
methylation datasets using
the three performance
metrics
• For each dataset get mean
of three ranks across
methods
• Then get the mean of ranks
across the datasets
• “dasen” appears to do the
best across probe types
38. ● MANY ways to perform initial quality control and pre-processing
○ Consider the samples used
■ e.g. between-array normalisation may not be appropriate for
cancer samples
GOAL: identify failed samples and reduce impact of technical artifacts without
removing meaningful biological variation
Marabita et al. Epigenetics (2013)
Summary
39. Summary
A few of the within-array
normalization procedures
improved the concordance
between the 450k data and
the pyrosequencing data
• Marked improvement
using SQN, noob, and
BMIQ
• Blue, orange and red
indicate Infinium typeI/II
bias correction methods,
color bias adjustment and
background correction
methods, respectively.
Dedeurwaerder et al. Briefings in Bioinformatics (2013)
40. Summary
Dedeurwaerder et al. Briefings in Bioinformatics (2013)
• HCT116 data: more global differences, performed worse with between-
array normalization
• Roessler’s data: no improvement with between-array normalization
41. ● Additional considerations: Filter a priori?
○ If remove loci with little inter-sample variability, may miss loci with
small, but very significant effect sizes
○ May be SNP in probe, but SNP has a minor allele frequency too low
to impact associations with methylation
○ But removing these sites reduces the number of comparisons we
need to account for when adjusting for multiple testing
GOAL: identify failed samples and reduce impact of technical artifacts without
removing meaningful biological variation
Summary
43. Identifying Batch Effects
You may identify batch effects in your data that have not been removed by
normalisation
● Batch effects are subgroups of data that are not related to biological or
other variables in the study
○ chips that were run on separate days
○ bisulphite modifications that were performed in different batches
44. Approaches to Remove Batch
Effects
● Can adjust for batch in downstream analysis (eg in regression)
○ Has been done in some published articles
..but may not effectively deal with batch issue
Options to address batch effect:
○ ComBAT
○ SVA
○ ISVA
○ RUV2
45. ComBat
Johnson et al. Biostatistics (2007)
● Linear model for batch effects and
uses Empirical Bayes method to
estimate the batch effects
o Instead of full Bayesian
approach, uses Empirical Bayes
methods to estimate the
hyper-parameters from the
data
Helps in small sample sizes
by borrowing information
across genes
Before After
46. ComBat
Johnson et al. Biostatistics (2007)
● Works best when:
o Small sample size
o Known batch effects
o Linear batch effects
● Disadvantages:
o Computationally intensive
o Only correct for batch effects from known sources
o Assumption of linear effects and normality may be violated
In other words, not great for large studies with complicated batch effects
ComBat function in SVA package
47. Input
● matrix containing methylation data (Mvalues)
mdata
● vector indicating the batch variable to adjust for
batch <- samplepheno$TCGABATCH
● model matrix containing the full model
mod <- model.matrix(~as.factor(TUMOR), data=samplepheno)
● null model (in this case no other covars so only the intercept)
mod0 = model.matrix(~1,data=samplepheno)
Run ComBat
combat_mdata<- ComBat(dat=mdata, batch=batch, mod=mod,
numCovs=NULL, par.prior=TRUE)
ComBat
48. Output
● Returns a corrected matrix with the same dimensions as your original
dataset with batch effects removed
● Run signficance analysis on the adjusted data
pValuesComBat = f.pvalue(combat_mdata,mod,mod0)
qValuesComBat = p.adjust(pValuesComBat,method="BH")
OR
● run analysis model using the adjusted data
result<- cpg.assoc(combat_mdata~as.factor(samplepheno$TUMOR))
ComBat
49. ComBat: Simulation
group = rep(c(-1,1),each=20)
coinflip = rbinom(40,size=1,prob=0.8)
batch = group*coinflip + -group*(1-coinflip)
gcoeffs = rep(0,10000)
bcoeffs = rnorm(10000,sd=2)
coeffs = cbind(bcoeffs,gcoeffs)
mod = model.matrix(~-1 + batch + group)
modelprojected<-coeffs%*%t(mod)
dat0<-t(apply(modelprojected,1,function(x)
x+rnorm(ncol(modelprojected),sd=1)))
par(mfrow=c(2,1))
plot(group,main=expression(bold("Group")),pch=16)
plot(batch,main=expression(bold("Batch")),pch=16)
Batch is strongly associated with Group, but group
does not have a direct impact on outcome
50. ComBat: Simulation
## Set null and alternative models (ignore batch)
mod1 = model.matrix(~group)
mod0 = cbind(mod1[,1])
par(mfrow=c(2,1))
fit <- lmFit(dat0, mod1)
fit <- eBayes(fit)
hist(fit$p.value[,"group"],
xlab="p-value",
main=expression(bold("Unadjusted")),
col="orange")
combatresults<-ComBat(dat0, batch=batch,
mod=mod1,par.prior = TRUE)
fit <- lmFit(combatresults, mod1)
fit <- eBayes(fit)
hist(fit$p.value[,"group"],
xlab="p-value",
main=expression(bold("ComBat")),
col="purple")
Still some residual confounding, but much better
than unadjusted analysis
51. ● Leek and Story PLoS Genetics (2007)
● Used to identify and estimate surrogate variables for unknown,
unmodeled or latent sources of noise
○ Appropriate when there are many known or unknown
confounders
○ May not be appropriate if the biological groups of interest are
heterogeneous
■ eg. Comparing cancer cases and controls where there are
different cancer subgroups as do not want to lose this
variation
Surrogate Variable Analysis (SVA)
52. Step 1
Obtain residual matrix (remove variation associated with variables of
interest), calculate the singular value decomposition (SVD) of the
residual matrix, perform test to assess whether singular vectors
represent more variation than expected due to chance
Step 2
Identify the subset of genes driving each orthogonal signature of the
variation
● Test association between each probe and each singular vector of
the SVD
Step 3
For each of these subsets, build a SV based on the full signature of that
subset in the original data
● Allows the SVs to be correlated with the primary variables
Step 4
Include all significant SVs as covariates in subsequent regression
analyses
Surrogate Variable Analysis (SVA)
53. Leek and Story PLoS Genetics (2007)
PrimaryVariableValueUnmodeledFactorValue
Arrays
Arrays
Genes
Example of Expression Heterogeneity
Surrogate Variable Analysis (SVA)
54. Input
● Matrix containing methylation data
betas(methylumidata)
● Model matrix containing the full model
mod <- model.matrix(~as.factor(tumor), data=pData(methylumidata))
● Null model (in this case no other covars, so only the intercept)
mod0 = model.matrix(~1,data=pData(methylumidata))
Run sva
sva_output<- sva(dat=na.omit(betas(methylumidata)),mod=mod, mod0=mod0)
Main Output
● Can adjust for sva_output$sv in model - a matrix of surrogate variables
Surrogate Variable Analysis (SVA)
55. ● Teschendorff et al. Bioinformatics (2011)
● Developed due to potential issues with SVA
○ Surrogate variables may capture heterogeneous phenotypes and/or model
misspecification
■ i.e. the residual variation may contain biologically relevant variation
● If potential confounders are known (either exactly or subject to
error/uncertainty) ISVA will select only those independent components that
correlate with the confounders
○ Otherwise similar approach as SVA - removes the variation in the data
matrix not associated with the phenotype of interest, and performs
Independent Component Analysis (ICA) on this residual variation matrix
○ BUT only keeps ISVs that are associated with putative confounders
Independent Surrogate Variable
Analysis (ISVA)
56. Input
● Matrix containing methylation data
methyldata<-na.omit(betas(methylumidata))
● Vector of the phenotype of interest (only takes numeric data)
binarytumor<-rep(0,ncol(methylumidata))
binarytumor[pData(methylumidata)$tumor=="yes"]<-1
● Matrix of potential confounding factors (may be numeric or categorical)
factors.m<-pData(methylumidata)[,c("age_at_initial_pathologic_diagnosis",
"TCGAbatch","tissue_source_site","anatomic_neoplasm_subdivision","ajcc_patholog
ic_tumor_stage")]
Run ISVA
isva.o <- DoISVA(methyldata, pheno.v=binarytumor, factors.m, factor.log=
c(FALSE,TRUE,TRUE,TRUE,TRUE,TRUE), pvthCF=0.1, th=0.001)
Independent Surrogate Variable
Analysis (ISVA)
57. Significant
Surrogate
Variables
Phenotype of
interest age at diagnosis TCGA batch
tissue source
site
anatomic
neoplasm
subdivision
ajcc pathologic
tumor stage
1 0.00054 0.70767 0.22121 0.56843 0.57395 0.83765
2 0.00288 0.45119 0.35796 0.44187 0.06801 0.5264
3 0.13759 0.7693 0.49232 0.474 0.3279 0.79098
4 0.53001 0.54648 0.05478 0.11074 0.55215 0.50216
5 0.01475 0.93185 0.18196 0.2447 0.84541 0.38136
6 0.74174 0.0332 0.62394 0.39353 0.9453 0.78771
7 0.0492 0.58565 0.91446 0.81124 0.71186 0.86681
8 0.04651 0.95817 0.91115 0.79168 0.02937 0.464
Looking at potential confounders and indicators of batch (p-value for association with
each candidate ISV)
● 8 candidate ISVs, 3 associated with at least one variable (p<0.1)
● Would only include the 3 significant, selected ISVs
Independent Surrogate Variable
Analysis (ISVA)
58. ● Gagnon-Bartsch and Speed Biostatistics (2012)
● Like SVA, an analysis that estimates and adjusts for unknown
surrogate variables
● Tries to address problem that ISVA tries to tackle, discerning the
unwanted variation from the biological variation that is of interest to
the researcher
● Restricts variation decomposition to negative control genes
● Requires negative control genes are genes whose expression levels
are known a priori to be truly unassociated with the biological factor
of interest
Remove Unwanted Variation-2
(RUV2)
59. Summary
Approach Pros Cons
ComBat Appropriate when groups are
heterogeneous and batch effect
is known
Batch effect may be
complicated mixture of factors
SVA Do not need to know the
unmeasured confounders and
may capture impact of cell
mixture
Surrogate variables may
capture heterogeneous
phenotypes and/or model
misspecification
ISVA Avoid capturing meaningful
biological varation
Need to have surrogates for
potential confounders
RUV2 Avoid capturing meaningful
biological varation
Need to define a subgroup of
probes that are not influenced
by exposure of interest
59
60. ● Distributions of methylation data are not always normal
○ use of transformations
○ option to remove probes with more than one mode
● Batch effects can be large and may be due to known or unknown
factors
○ use careful study design to minimise the impact of batch effects
○ use of post analysis methods to reduce batch effects
Summary
62. ● This section focusses on cellular heterogeneity in blood
most cohort studies are currently analysing data from blood samples
● Also relevant for data analysis in other tissues where cellular
heterogeneity is also present
○ currently less defined methods
● Why does cellular heterogeneity matter?
● What can we do about it?
Overview
63. ● An issue for many population based studies
● DNA extracted from blood containing many cell types
○ Large source of variation in methylation data from blood
Cellular Heterogeneity in Blood
64. Heatmap of cell sorted 450k
data
Jaffe & Irizarry Genome
Biology (2014)
Cellular Heterogeneity in Blood
65. Why is it an issue?
● Bias if outcome of interest correlates with cell composition
○ Confounding by immunological profile
● Uninteresting variation: mediation by immunological profile
○ Reflects a previously known mechanism: real goal is to find differences in
methylation beyond the cell composition associations
○ Usually seen with environmental exposures
● Temporality is an issue/difficulty for cross-sectional data
○ e.g inflammation preceding or following cancer?
Cell
distribution
Methylation Infertility
PM2.5
Cell
distribution
Methylation
Cellular Heterogeneity in Blood
66. Flow cytometry
● Adjust for cell proportion or restrict analysis to one subtype
○ At discovery or validation/replication
HOWEVER:
○ expensive
○ time consuming
○ requires fresh samples (often impossible in cohort studies)
Gold standard
Correcting for Cellular
Heterogeneity in Blood
67. Houseman et al. BMC Bioinformatics (2012)
● General goal: use purified cells (“gold standard”) to build a model to
predict the distribution of leukocytes for the analysis of population data
○ Can also be used to predict the distribution of leukocytes in a single
sample given its DNA methylation profile
● Resembles regression calibration approach in measurement error
literature
○ Assumption of transportability - e.g. is the mechanism giving rise to
the measurement error the same in cord blood as it is in adult blood?
● Sorted WBCs in S0 from Houseman paper were run on the 27K array
○ Choose m sites with strongest association between methylation and
cell types based on F statistic to estimate cell proportions
Correcting for Cellular
Heterogeneity in Blood
68. Correcting for Cellular
Heterogeneity in Blood
Direct Epigenetic
Response
Immune Response
Measured DNA
methylation
Phenotype
Y matrix
𝑀𝑀Ω𝑇𝑇
Ω is the cell type
proportions for
each individual
𝐵𝐵𝐵𝐵 𝑇𝑇Covariates X
Ω = 𝑋𝑋Γ + Ξ
Γ Cell composition effects
69. Correcting for Cellular
Heterogeneity in Blood
• If a reference set is available, we can estimate the cell
composition effect M
• We can estimate the individual cell proportions Ω from the
methylation profile Y and the cell composition effects M
• Can then adjust for estimated cellular composition in
the subsequent analysis
𝑌𝑌 = 𝐵𝐵𝐵𝐵 𝑇𝑇
+ 𝑀𝑀Ω𝑇𝑇
+ 𝐸𝐸
Direct Effect Cell
Composition
Effect
70. Compared to gold standards, this method has relatively high precision
Accomando et al. Genome Biology (2014)
Correlation between cell proportions estimated by DNA methylation and
proportions quantified by established methods among whole blood samples
from disease-free human donors
Correcting for Cellular
Heterogeneity in Blood
71. Houseman et al. BMC Bioinformatics (2012)
● Coefficients estimated using 27k data do not seem to work well for 450k data
○ Can rebuild Houseman approach using cell sorted 450k dataset
● Can be implemented using estimateCellCounts() from minfi package
Reinius LE et al Plos ONE
(2012)
Cell populations isolated
by magnetic- activated
cell sorting, purified using
specific antibodies
Correcting for Cellular
Heterogeneity in Blood
72. Cell proportion predicted for Reinius (2012) samples
Predictedcellproportion
Correcting for Cellular
Heterogeneity in Blood
73. Jaffe and Irizarry. Genome Biology
(2014)
● Proportions of cell types estimated
using DNA methylation
● Can see shifts in composition
associated with aging
○ Not (usually) the variation that
we are interested in describing
● Age is confounder in many
epidemiology studies
○ must consider impact on
cellular heterogeneity
Cellular Heterogeneity and Aging
74. Other tissues also contain a mixture of cells
● To correct for this must either: microdissect all samples, or create a
reference microdissected data set to rebuild Houseman approach
The creation of reference-free methods
● Houseman et al. Bioinformatics (2014)
○ Uses modified ISVA to identify latent components of the observed
methylation variation, assumed to capture differences in cell
distributions
● Zou et al. Nature Methods (2014)
○ Finds simplest combination of principal components with linear mixed
model that controls for test inflation
Cellular Heterogeneity in other
Tissues
75. Houseman et al. Bioinformatics (2014)
● Similar to ISVA and SVA, except makes an additional biological mixture assumption
○ Dependence of the latent structure of the error on the unknown, cell-specific
methylation matrix
library(RefFreeEWAS)
test<-RefFreeEwasModel(betas(methylumidata.bgcorr)[c(1:1000),], mod, 8) # example 1000 loci,8 ISVs
testBoot <- BootRefFreeEwasModel(test,500) #500 bootstrap datasets
Significance of adjusted estimates
BstarSE<-apply(testBoot[,2, "B*",],1,function (x) sd(x))
BstarT<-test$Bstar[,2]/BstarSE
BstarP<-2*pnorm(-abs(BstarT))
Significance of adjusted estimates
BetaSE<-apply(testBoot[,2, "B",],1,function (x) sd(x))
BetaT<-test$Beta[,2]/BetaSE
BetaP<-2*pnorm(-abs(BetaT))
Cellular Heterogeneity in other
Tissues
76. Houseman et al. Bioinformatics (2014)
Find that the reference
free approach in TCGA
samples reduces the
range of effect sizes,
and attenuates
significance of 1000
sample loci
● what variation is
this capturing?
Cellular Heterogeneity in other
Tissues
77. Zou et al. Nature methods (2014)
● Factored spectrally transformed linear mixed model 'EWASher' (FaST-
LMM-EWASher)
● Reference free approach (does not estimate cell type composition)
● Computes the methylome similarity between each pair of samples in
the data set to get covariance. This is used in the linear mixed model
as an implicit proxy for cell-type composition in conjunction with
principal components.
● No issue of method portability as no reference set but if number of
true associations is large there is reduction in power
Cellular Heterogeneity in other
Tissues
78. Zou et al. Nature methods (2014)
Important note from co-author Martin Aryee on PubMed Commons:
“EWASher (Zou J, 2014) is intended to be used in EWAS settings where the
primary interest is in identifying localized differentially methylated regions (i.e.
DMRs that affect only a small fraction of methylation sites).* The results of
EWASher should be interpreted with caution in settings where large-scale
methylation changes are expected and/or of interest. The method assumes that
large-scale changes are caused by cell type composition effects and will
effectively remove these changes from consideration. This is useful in many
EWAS settings, but the assumption may not hold when studying cancer or
differences between tissues. In the cancer dataset used in our paper, for
example, we specifically identify site-specific changes that are above and beyond
global hypomethylation changes”
*my bolding
Cellular Heterogeneity in other
Tissues
79. ● 354 rheumatoid arthritis cases and 312 controls across 103,638 loci
○ a. shows QQ plot from unadjusted model
○ b. shows QQ plot where cell-type composition covariates were included in
the model (Houseman)
○ c. shows QQ plot using EWASher
Cellular Heterogeneity in other
Tissues
80. ● Cellular heterogeneity is a particular issue for cohort studies where
cell counts are unknown
● Heterogeneity could cause bias in results
● Cell count related hits may not be of interest
● ‘Best practice’ not yet established
● Consideration for reference-free approaches: assumes the major
determinant variation is cell composition
○ may not be true, or may not be true for all tissues
Summary