This presentation is one of my main undergraduate projects focused on identifying the type of bacteria attacking a tomato plant, and the molecular candidates within the tomato plant making an attempt to defend itself. The aim of this presentation is not only to demonstrate the type of data analyses performed in molecular biology, but also to facilitate the knowledge translation between a highly specialized field in life science research and the general public and/or stakeholders who find values in it.
Techniques used: database construction, data cleansing, correlation analysis, significance analysis of microarrays (SAM), hierarchical clustering, a bunch of nerdy molecular biological/biochemical laboratory techniques (shotgun cloning & sequencing, PCR gel electrophoresis, MOPS RNA gel electrophoresis, 2D PAGE gel, etc.)
Tools and Softwares used: MS Excel, MS Access, Cluster (ClusterTreeview), Sequencher, NCBI ORF Finding & GeneMark, NCBI BLAST, Mascot, Uniport
3. Shotgun cloning • PCR gel electrophoresis
*Arbitrary representation
from P. Synringae
Chain Termination Method
4. Sequencher: puzzle back the fragments, highest coverage at where two opposite
strands overlap
NCBI ORF Finder & GeneMark:
identify the protein-coding
genes from repositories based
on the hits with the highest
scores
Suspect: Pseudomonas Synringae!
*Arbitrary representation
6. Control Tomato plant types
*RNA: transcripts of DNA for protein
synthesis
*Illustration of principle only, Affymetrix
Genechip was used instead: finer probes
and control on noise/signal distinction
Tomato plant types:
i) Negative control (no plant)
ii) Normal (healthy)
iii) Mutant (no immune response)
iv) Hypersensitive (stronger immune
response)
7. Preprocessing: cleansed 12 files (~25,000 rows per file) to preserve
probe set name, signal (float), and detection (nominal) across 4
treatments
Merging 3 sets of data:
i) 3 replica of each treatment
ii) all 4 treatment samples
iii) annotation dictionary of probe set names linked by unique
identifiers
Scatterplot: signals from triplicates averaged, ratio of treatments over
negative control & log transformed, 3 treatments plotted against each
other
Pearson’s correlation coefficient: normal v.s hypersensitive most
correlated (0.86), normal v.s mutated least correlated (0.57)
8. Significance analysis of microarrays
(SAM): uses False Discovery Rate to filter
out genes with significant expression
differences between 2 treatment conditions
• Hierarchical Clustering: genes
selected based on SAM were hierarchically
clustered to identify the group of genes
with similar expression changes (red:
more, green: less)
*Arbitrary representation
9. Increasing the amount of immunological
receptors (TIR proteins) to trigger immune
response
Negcontrol
Normal
Mutant
Hypersensitive
• Switching from energy generation mode to
energy expenditure mode in the face of
stress (more UCP-1, less CcdA)
SAM identifies statistically significant genes by carrying out gene specific t-tests and computes a statistic dj for each gene j, which measures the strength of the relationship between gene expression and a response variable.[1][2][3] This analysis uses non-parametric statistics, since the data may not follow a normal distribution. The response variable describes and groups the data based on experimental conditions. In this method, repeated permutations (of different conditions) of the data are used to determine if the expression of any gene is significant related to the response. The use of permutation-based analysis accounts for correlations in genes and avoids parametric assumptions about the distribution of individual genes. This is an advantage over other techniques (e.g., ANOVA and Bonferroni), which assume equal variance and/or independence of genes.[4]
>more UCP-1: turn off ATP (energy currency) production>less CcdA: slow down mitochondrial (power house) activity