Catalyzing Plant Science Research with RNA-seq

Manjappa
Ph. D. Scholar
Dept. of Genetics & Plant
Breeding
UAS, GKVK, Bengaluru, India
Catalyzing plant science research with RNA-seq
1

Central dogma of molecular biology
Transcriptome
(mRNA, rRNA, tRNA, and
other non-coding RNA)
2

Why to study transcriptome ?
It reflects the genes that are being actively expressed at any
given time (expression profiling)
Expression level of mRNAs in a given cell population varies
How an organism adapt to the developmental cues and
environmental fluctuations.
3

Quantify the changing expression levels of each transcript
during development and under different conditions
Catalogue all species of transcript (mRNAs, non-coding
RNAs & small RNAs)
Determine the transcriptional structure of genes, in terms
of their start sites, 5′ and 3′ ends, splicing patterns and
other post-transcriptional modifications
Aim of transcriptomics
4

Limitation in microarray technique
Reliance upon knowledge of genome sequence
High background levels owing to cross hybridization
Limited dynamic range of detection owing to background &
saturation of signals
Comparing expression levels across different experiments is often
difficult & can require complicated normalization methods
Sanger sequencing of cDNA or EST libraries:
- Relatively low throughput, expensive & generally not quantitative
Tag-based methods (SAGE, CAGE & MPSS):
 high throughput & precise, ‘digital’ gene expression levels
Most are based on expensive Sanger sequencing technology, & a
significant portion of the short tags cannot be uniquely mapped
to the reference genome
only a portion of the transcript is analyzed and isoforms are
generally indistinguishable from each other
6

Wang et. al, RNA-Seq: a revolutionary tool for transcriptomics, Nat. Rev. Genetics 10, 57-63, 2009).
Next generation sequencing (NGS)
Sample preparation
Data analysis:
Mapping reads
Visualization (Gbrowser)
De novo assembly
Quantification
RNA-sequencing
7

RNA-seq vs. microarray
• RNA-seq can be used to characterize novel transcripts and splicing
variants as well as to profile the expression levels of known
transcripts (but hybridization-based techniques are limited to detect
transcripts corresponding to known genomic sequences)
• Detect large dynamic range of expression levels (9,000 fold)
compared to microarray (100-few-hundred fold
• RNA-seq has higher resolution than whole genome tiling array
analysis
• In principle, mRNA can achieve single-base resolution, where
the resolution of tiling array depends on the density of probes
• High levels of reproducibility, for both technical and biological
replicates
• RNA-seq can apply the same experimental protocol to various
purposes, whereas specialized arrays need to be designed in these
cases
• Detecting SNPs (needs SNP array otherwise)
• Mapping exon junctions (needs junction array otherwise)
• Detecting gene fusions (needs gene fusion array otherwise)
8

RNA-seq and microarray agree fairly well only for
genes with medium levels of expression
Saccharomyces cerevisiae cells grown in nutrient-rich media. Correlation is very low
for genes with either low or high expression levels.
9

Advantages of RNA-Seq compared with other
transcriptomics methods
10

RNA Seq helps to look at
Alternative gene spliced transcripts
Post-transcriptional modifications
Gene fusion
Mutations/SNPs
Changes in gene expression
• Used to determine exon/intron boundaries
• Verify or amend previously annotated 5’ and 3’ gene
boundaries.
• Also includes miRNA, tRNA, and rRNA profiling
11

Library construction
RNA fragmentation (RNA
hydrolysis or nebulization) &
cDNA fragmentation (DNase I
treatment or sonication)
Bioinformatic challenges
Devt. of efficient methods to
store, retrieve and process
large amounts of data, which
must reduce errors in image
analysis and base-calling and
remove low-quality reads.
13
Challenges for RNA-Seq
bias at depleted 5′ and 3′ ends
bias at 3′ ends

• First draft of Arabidopsis thaliana genome sequence (2000); its
annotation continues to be improved
• Large amounts of Sanger sequencing-generated EST data provided the
initial basis for gene identification and expression profiling
Expensive, time consuming, inherently biased against low-
abundance transcripts & are typically enriched in transcript
termini
• RNA-seq circumvents these limitations and provides accurate
resolution of splice junctions and alternative splicing events
• Arabidopsis transcriptome survey using Illumina shows
- At least 42% of intron-containing genes are alternatively
spliced (Filichkin et al., 2010)
- 61% when only multi-exonic genes are sampled
- ~48% of rice genes (Lu et al.,2010)
1. IMPROVING GENOME ANNOTATION WITH TRANSCRIPTOMIC DATA
15

Contd…
• Mining RNA seq data in search of TSS variation is improving gene
structure annotation and alternative TSSs have been detected in
∼10,000 loci in Arabidopsis and rice
(Tanaka et al., 2009).
• An ideal genome annotation would identify
 Genes that show invariant transcript sequences
 Those that exhibit alternative splicing and
 Link these events to specific spatial, temporal, developmental,
and/or environmental cues.
• Abiotic stress in Arabidopsis can increase or decrease the proportions of
apparently unproductive isoforms for some key regulatory genes,
supports alternative splicing is an important mechanism in the
regulation of gene function
(Filichkin et al., 2010)
16

Contd…
• Polymorphisms between different A. thaliana accessions is
one SNP per ∼200 bp.
• Complete re-sequencing of the transcriptomes and
annotation of different accessions helps to interpret the
functional consequences of polymorphism
• Utilizing genomic and transcriptomic data for in silico gene
prediction results in a more reliable annotated genome,
with Information on SNPs, indels, splice variants and
expression variation
17

Generating genomic and enabling proteomic
resources for “non-model” species
• Published plant genome sequences represents very small fraction of plant
taxonomic diversity
• Study of “non-model” species challenging
• de novo sequencing of the transcriptome to generate genetic resources
1. Eucalyptus (mizrachi et al., 2010)
2. Garlic (sun et al., 2012)
3. Pea (franssen et al., 2011),
4. Chestnut (barakat et al., 2009)
5. Chickpea (garg et al., 2011)
6. Olive (alagna et al., 2009)
7. Safflower (lulin et al., 2012
8. Japanese knotweed (Hao et al., 2011).
Gene annotation relies on identifying homologs, & ideally orthologs, in
species with an annotated genome (if no appropriate EST databases are
available)
If not, A. thaliana genome sequence (Gold std.)
Further confirmation; interrogating additional plant databases
Annotation with pre-existing EST database Eg: melon (Dai et al., 2011)
Same
function
different
function
18

• De novo RNA-seq to identify genetic polymorphisms
(molecular breeding), wherein multiple cultivars or close-
related species with variations in traits of interest are
sequenced and genetic variation is identified.
Allows generation of molecular markers to facilitate
progeny selection and molecular genetics research
Ex: 12,000 SSRs in a single RNA-seq analysis of sesame
(earlier only 80 SSRs), on average 1 genic-SSR per ∼8 kb
(Zhang et al., 2012)
5,234 SNPs in transcriptomes of five winter rye inbred
lines. Used in a high-throughput SNP genotyping array
(Haseneyer et al., 2011)
• Comparative sequence analysis of radish RNA- seq data and
Brassica rapa genome sequence lead to the discovery of
14,641 SSRs
Contd…
(Wang et al., 2012)
19

RNA Seq application to advance the field of
proteomics.
• Effective proteome profiling is generally considered to
depend heavily on the availability of a high-quality DNA
reference database
• High-throughput mass spectrometry-based protein
identification relies on the availability of an extensive DNA
sequence database in order to match experimentally
determined peptide masses with the theoretical proteome
generated by computationally translating transcripts
• RNA- seq based transcriptome profiling can provide an
effective data set for proteomic analysis of non-model
organisms
20

“RNA-Seq, facilitates the
matching of peptide mass
spectra with cognate gene
sequence”
• To test this, quantitative
analysis of the proteomes of
pollen from domesticated
tomato (Solanum
lycopersicum) and two wild
relatives
• RNA-Seq (454
pyrosequencing); >1200
proteins were identified
No major qualitative or quantitative differences were observed in the characterized proteomes
either with a highly curated community database of tomato sequences or the RNA-Seq database21

Characterizing temporal, spatial, regulatory,
and evolutionary transcriptome landscapes
Temporal transcriptome
• RNA-seq is increasingly being adopted to examine
transcriptional dynamics
• Ananalysis of transcriptome of grape berries during three
stages of devt. identified >6,500 genes that were expressed
in a stage-specific manner (Zenoni et al.,2010)
• Radish >21,000 genes differentially expressed at two
developmental stages of roots, includes genes strongly
linking root development with starch and sucrose
metabolism and with phenylpropanoid biosynthesis.
(Wang et al., 2012)
22

Objective: To understand the molecular mechanisms underlying tuberous root
formation and development.
• Radish (R. sativus) cultivar ‘Weixianqing’.
• Samples; cultivar ‘Weixianqing’.
• hypocotyl (1 cm, 7DAS) & true root (1 cm, 20 DAS (RLSS, the stage of cortex splitting),
10 seedlings of each were pooled together
• Illumina paired-end sequencing technology GAII platform (BGI; Shenzhen, China)
• Gene annotation: Comparative genome analysis between radish and Brassica rapa.
Unigenes were aligned with sequences in NCBI non-redundant protein (Nr) database,
Swiss-Prot protein database, Kyoto Encyclopedia of Genes and Genomes (KEGG)
pathway database & Cluster of Orthologous Groups (COG) database using BLASTx
Annotation by using Blast2GO program
23

• Sequence similarity search was conducted against the NCBI
Nr (85.51 %), Nt (90.18%) and Swiss-Prot protein databases
(54%) using the BLASTx algorithm
• 21,109 unigenes were assigned GO terms.
Functional annotation of all non-redundant unigenes
Gene Ontology classification of assembled unigenes
(9,271; 43.92 %)
24

Transcript differences between RESS and RLSS
13,453
8,389
To understand the functions of
DEGs, mapped all the DEGs to
terms in the KEGG database &
found 29 pathways were
significantly enriched.
carbohydrate, energy, lipid,
amino acid, other amino acids,
terpenoids and polyketides,
metabolism and biosynthesis
of other secondary metabolites
20 (starch and sucrose metabolism) and 25 (phenylpropanoid
biosynthesis) unigenes significantly up-regulated, play a critical roles in
regulating radish tuberous root formation. Also confirm finding of radish
root is rich with carbohydrates and phenolic compounds. 25
starch and sucrose metabolism
(303) and phenylpropanoid
biosynthesis (177) two
predominant groups

• Previous gene expression studies using EST sequencing, spotted
microarrays & Affymetrix Gene Chip tech. (based on prior sequence)
• Provides only a fragmented picture of transcript accumulation patterns.
• RNA Seq to 7 tissues (leaf, flower, pod, two stages of pod-shell, root,
nodule) & 7 seed devt. Stages of BC5F5 plant G. max
• Compare transcript reads with recent genome sequence (assembly
Glyma1.01)
• Potential model for future RNA-Seq atlases
27

Mapping of short-read sequences:
• Illumina Genome analyser-II: produced 5.8-8.9 mill. 36-bp reads
for 7 non-seed tissues & 2.7-9.6 mill. 36-bp reads for seed tissues
• Alignment program GSNAP was used to map the reads to two
reference genomes: G. max and Bradyrhizobium japonicum.
• Digital gene expression analysis: 46,430 genes identified as “high
confidence” (correlation to full length cDNAs, ESTs, homology, &
ab initio methods)
• Of which 41,975 (90.4%) genes were transcriptionally active
Expression and gene structure
Coding regions of transcriptionally inactive genes were smaller and had a lower GC content
28

Hierarchical clustering of transcriptional
profiles in 14 tissues.
Tissue-specific analysis of the soybean transcriptome
Relative expression levels based on Z-score
analysis (3.4-3.6 more tissue specific)
early seed devt. stages late seed devt. stages
aerial tissues underground tissues
Z = (X-μ)/sd
Tissue
specific
Tissue
specific
29

Heatmap of the Legume Specific GenesHeatmap of top 500 highest expressed genes
Some legume specific genes
have tissue specific transcription
Glyma06g08290
Glyma04g08220 (Oleosin)
Glyma02g01590 (lectin precursor 1)
30

General trends in expression profiles for all genes tissue by tissue comparison (Fishers Exact test)
Higher transcriptional level
Importance: Understand Gene functions and molecular process occur during two stages.
GOslim analysis; between Seed 25 DAF & seed 28 DAF, seed 35 & 42 DAF, stably expressed
nutrient reservoir activity & urease activity. Which are imp activity in seed devt. 31

Genes structure and tissue specific gene expression
• Underground tissue have larger first exon, aerial has higher # of
exons.
• Significant difference in total transcription length among tissue
due to varying intron length.
• No significant difference between GC content and tissue
specificity
32

Boxplot Dendrogram of preferential expressed genes in seed development
RPKM normalized log2-
transformed expression
gene profiles
33

Summary
• RNA Seq-Atlas provides
• A record of high-resolution gene expression in a set of
14 diverse tissues
• Hierarchical clustering of transcriptional profiles for 14
tissues
• Relationship between gene structure and gene
expression
• Tissue-specific gene expression of both the most highly-
expressed genes and the genes specific to legumes in
seed development and nodule tissues
• A means of evaluating existing gene model annotations
for the Glycine max genome
34

Spatial transcriptome
• Most RNA-seq analyses target whole organs, or sets of
organs, which inherently prevents the identification of cell
or tissue type transcripts, and thus spatially coordinated
structural and regulatory gene networks.
• RNA-seq analysis of discrete tissues or cell types: Spatial
information and increase the depth of sequence coverage
Ex: >1000 genes have specifically or preferentially
expressed in Arabidopsis male meiocytes
• Acquiring tissue or cell-specific samples with any degree of
precision and minimal contamination is often technically
difficult
35

Methods of isolation of single cells
36

Contd…
• Matas et al. (2011): LCM + RNA-seq (454 pyrosequencing)
transcriptomes of 5 principal tissues of the developing
tomato fruit pericarp.
~ 21,000 unigenes identified & more than half showed
ubiquitous (57%) expression, while other showed cell
type-specific expression
• Takacs et al. (2012): LCM + RNA-seq (Illumina-based NGS)
study of the ontogeny of maize shoot apical meristem
59% of genes expressed ubiquitously
• A number of mammalian tissues also shown a high
proportion of ubiquitously expressed transcripts
“These studies may indicate that this is a common feature of
eukaryotes” (Ramsköld et al., 2009).
37

To study plant responses and adaptations to abiotic
and biotic stresses
Aim: Elucidate genes and gene networks that contribute to
sorghum’s tolerance to water-limiting environments with a long-
term aim of developing strategies to improve plant productivity
under drought
 Discovered >50 previously unknown drought- responsive genes.
38

Up Down
ABA ~2,300 ~2,600
PEG ~1,650 ~700
20 μM
8th day,
57.1 μM
20% PEG-8000
Transcript Analysis
in Response to
ABA and Osmotic
Stress
Method
ABA in response to plant
stress, and its central role
in other pathways,
(dormancy in leaf & seed)
LEA protein
WSI18 protein
dehydrin
sugar
substrate
transporter
peroxidase 6
39

A, brassinosteroid biosynthesis
B, cytokinins degradation
C, cytokinins glucoside biosynthesis
D, ent-kaurene biosynthesis
E, ethylene biosynthesis from methionine
F, gibberellin biosynthesis
G, gibberellin inactivation
H, IAA conjugate biosynthesis
I, jasmonic acid
Networks of hormone pathways in ABA-treated plants
Shoots Roots
Box=Hormone-related
Circle=non-hormone-related
Down Up DE genes
Dark blue solid lines= ≥10 blue
long-dashed lines=6-9 light
blue short-dashed= ≤5
 Only the brassinosteroid and JA biosynthesis pathways, and cytokinin glucoside and IAA
conjugate biosynthesis pathways are directly connected via DE genes.
 Indirect ‘cross-talk’ between the various hormones in response to osmotic stress and ABA
40

Determining the genes of unknown
function that respond to drought or
ABA treatment across species
Decision tree used to determine which
genes and their orthologs were regulated by
drought/ABA across different species
Overlap of drought-responsive sorghum
genes of unknown function that had drought-
responsive orthologs of unknown function in
other species
41
(51) (82)
(183)

• RNA-seq used reveal massive changes in
metabolism and cellular physiology of the green
alga Chlamydomonas reinhardtii when the cells
become deprived of sulfur
• studies of plant responses to pathogens
Ex: sorghum Bipolaris sorghicola
(Mizunoetal.,2012)
• Complexities of the metabolic pathways associated
with plant defense mechanisms
42

Study plant evolution and polyploidy.
• A comparison of the leaf transcriptome of an allopolyploid relative of
soybean with two species that contributed to its homoelogous genome,
allowed the determination of the contribution of the different genomes
to the transcriptome (Ilut et al., 2012)
• Maize endosperm trascriptome analysis; discovered 179 imprinted
genes and 38 imprinted long ncRNAs (Zhang et al., 2011)
• Transcriptome of 9 distinct tissues of three species of the Poaceae
family (Brachypodium, sorghum & rice) to determine whether
orthologous genes from these three species exhibit the same expression
patterns (Davidson et al., 2012)
 Only a fraction of orthologous genes exhibit conserved expression
patterns
 Orthologs in syntenic genomic blocks are more likely to share
correlated expression patterns compared with non-syntenic
orthologs.
 These findings are important for crop improvement (seq transfer)
43

Hierarchical clustering of 27 tissues (9 tissues x 3
species) based on correlations of log2 FPKM mapped
expression values of 3-taxa single-copy (3x3) genes
Classification of Brachypodium, rice, and
sorghum genes into orthologous groups
clustering of
corresponding
tissue
extensive expression divergence within 3 · 3 genes
Red: single copy (2 X 2 & 3 X 3)
Black: multicopy (2 X N & 3 x N)
OrthoMCL 44

Genes within each k-means co-expression
cluster were categorized based on OrthoMCL
category assignments or as lineage-specific
single-copy (1 x 1) genes.
Co-expression analyses identify conservation
of expression among orthologs and paralogs
Proportions of genes with at least one corresponding paralog or ortholog in the same cluster
Portion of Poaceae orthologs and paralogs share same
expression patterns across reproductive tissues
Some genes exhibited different expression phenotypes
45
Similar expression pattern in Poacea

which biological processes were over-represented in
orthologs/paralogs category ?
Ortholog/paral
ogs
Gene ontology (GO) annotation
2 x N genes Stress-related functions (‘response to biotic stimulus’, ‘defense response’,
‘apoptosis’), lipid transport, secretion (‘exocytosis’), and general
oxidation–reduction reactions.
3 x N (higher
substitution
rates)
Core metabolic functions; ‘translation, ATP biosynthesis, nucleosome
assembly, and biosynthetic process & oxidation–reduction, response to
wounding, sexual reproduction.
3 x 3 genes Essential functions: regulation of transcription’ (>1000 genes),
‘protein folding’ (253 genes), ‘intracellular protein transport’
(123 genes), and ‘glycolysis (91 genes)
2 x 2 genes protein amino acid phosphorylation, ‘regulation of transcription &
response to oxidative stress
46

Relationship between synteny and expression patterns of orthologs
Syntenic gene pairs within collinear blocks of at
least five genes were identified for all pairwise
combinations of three Poaceae species
Distributions of Pearson’s correlation coefficients (PCC)
synteny plays a significant role in
evolution of gene expression, especially
in the case of duplicate and multicopy
genes
47

Identifying and characterizing novel non-coding RNAs
• Insilico analysis provides a rapid way to identify putative sRNA
genes
• RNA-seq technology represents an excellent means for sRNA
discovery and validation
• Characterization of miRNAs regulatory functions to be facilitated
by determining tissue-specific expression pattern
• RNA-seq was used to identify sRNAs from five Arabidopsis root
tissues.
Some sRNAs expressed in all 5 tissues while others were
tissue and developmental zone specific
• The frequency of alternative slicing at miRNA binding sites is
significantly higher than that at other regions, suggesting that
alternative splicing is a significant regulatory mechanism.
• sRNAs have been recently characterized in the context of
association with epigenome modifications, including cytosine
methylation of genomic DNA
48

From co-expression networks to integrative data
analysis
• Sequencing whole transcriptomes provides a high degree of detail,
but deriving useful biological information from a long list of
expressed genes is typically not trivial
• Construct networks of co-expressed genes and to use gene ontology
(GO) information to help highlight important gene candidates as
critical components of functional networks
• Gene ontology enrichment analysis of RNA-seq data often illustrates
the complexity of interacting pathways
Robust Functiona
networks
Transcriptome: RNA-seq
proteomics
metabolomics
No correlation
ex: Soybean
protein X
Correlation
Ex:Oil plam
mesocarp Fatty
acid
49

Bulked Segregant RNA-Seq
SNP 2 being closely
related to the mutation to
map
linkage disequilibrium
between markers and
causal gene is determined
by quantifying the allelic
frequencies between two
samples
advantages:
(i) Having a reference genome is not
a prerequisite
(ii) Markers can be generated from
the experimental data
(iii) Differential expression profiles
(iv) Info on effects of mutant on
global patterns of gene
expression
(v) Provide map position of a gene
Liu et al. (2012)
BSA requires polymorphic markers
50

>64,000 SNPs
 Two alleles of a given SNP site should be
detected in approximately equal numbers of
RNA-Seq reads when considering both pools of
RNASeq data.
 Only one allele of a SNP that is completely
linked to the causal gene should be present
among the RNA-Seq reads from the mutant
pool
 In practice, as a consequence of Allele Specific
Expression and sampling bias, genes expressed
at low levels, single allele of many SNPs are
detected in the mutant pool.
 Empirical Bayesian approach used to estimate
linkage probability, i.e. probability of a SNP
exhibiting complete linkage disequilibrium
with the causal gene. 51

>64,000 SNPs
RNASeq data.
pool
with the causal gene. 52

>64,000 SNPs
RNASeq data.
pool
with the causal gene.
gl3-ref allele in a non-B73/B73
53

Reference genome
The top 10 windows with the highest median
linkage probability were located at physical
position ,183.5–185.2 Mb.
Fine mapping of gene.
1.Mutant gene expression will often be
down-regulated compared to the WT pool.
2. Collections of SNPs tightly linked to
mutant gene
3. SNPs linked to mutated gene can be used
for gene cloning via chromosome walking.
54

• Not necessary to use tissue with mutant gene
expression for BSR-Seq.
• However, if we collect tissue with expression we
can also get additional expression data.
Resolution of mapping depends on
1. # of individuals included in the bulks
2. Sequencing depth
3. Density of polymorphisms in mapping population
55

• International multi-disciplinary consortium; 1,000 plant sps. transcriptome data
• It is PPP project; funding of 75% from Govt. of Alberta, 25% by Musea
Ventures. BGI-Shenzhen- sequencing at reduced costs & iPlant collaborative -
computational informatics.
• Objectives:
1. Resolve many of the lingering uncertainties in species relationships,
especially in the early lineages of streptophyte green algae and land plants
2. To identify gene changes associated with the major innovations in
Viridiplantae evolution, such as multi-cellularity, transitions from marine to
freshwater or terrestrial environments, maternal retention of zygotes and
embryos, complex life history involving haploid and diploid phases, vascular
systems, seeds and flowers
• Species selection; representations of all major lineages across the Viridiplantae
(green plants), representing ~1 billion years of evolution, including flowering
plants, conifers, ferns, mosses and streptophyte green algae.
56

Resources available
1. Access to raw and processed data:
Content; transcriptome assemblies, putative coding
sequences, orthogroups and gene and species trees with
related sequence alignments.
2. High performance computing and cloud-based services:
iPlant discovery environment (DE) web interface (tutorials and
teaching materials available)
57

Phenylpropanoid synthesis pathway for Colchicum autumnale. Labelled rectangles are
proteins. Small circles are metabolites. Black lines show the KEGG pathway. Red lines show the
BioGRID interactions emanating from protein (K12355), which was interactively selected. A right-
click on the protein will display the inferred function and a link to the sequence(s)
Interactions & pathways
58

Conclusion
• RNA-sequencing is now well-established as a versatile
platform with applications in an ever growing number of
fields of plant biology research
• Ongoing developments in sequencing technologies, such as
increased read lengths, greater numbers of reads per run
• Advanced computational tools to facilitate sequence
assembly, analysis, and integration with orthogonal data
sets will further accelerate the breadth and frequency of its
adoption by plant scientists
59

Catalyzing Plant Science Research with RNA-seq

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Catalyzing Plant Science Research with RNA-seq

Similar to Catalyzing Plant Science Research with RNA-seq (20)

More from Manjappa Ganiger

More from Manjappa Ganiger (9)

Recently uploaded

Recently uploaded (20)

Catalyzing Plant Science Research with RNA-seq

Editor's Notes