1. Predicting lncRNA Transcripts Out of Comprehensive
Rat Renal Cell type-specific Transcriptome Libraries
Gui Chen
11/20/2015
2. WHY LONG NON CODING RNA?
➤ Many long non-coding transcripts
(lncRNAs) function in a variety of
responses which include differentiation,
cell cycle, and maintenance of stem-cell
like phenotypes, and are cell-type specific
in their expression. Yet, very little is
known about their regulation or roles in
disease states.
➤ A newly established rat renal gene
expression database and recently
assembled rn6 genome sequecne have
paved a way for us to conduct such study.
3. WHAT IS EXACTLY THE DATA SOURCE?
➤ 110(renal tubule segments) +
5(glomeruli) renal cell-type specific gene
expression profiles as a product of work
described in the paper shown left.
➤ 7 polyadenylated mRNA-seq(PA-seq) &
cortical collecting duct(4 control rat
and 4 water loaded rat)
➤ Totally 125 libraries
4.
5. WHAT IS THE FORMAT OF THE DATA
➤ Original transcripts data are stored in
GTF format which is a flat tab-delimited
file format that can be directly loaded
into excel.
➤ Next is a real case example of what GTF
records looks like.
7. How can we pick out those transcripts that potentially are long
non coding RNA transcripts from thousands of transcripts?
1. What are the characteristics of lncRNA from preliminary data and
experience?
➤ Less conserved than protein-coding genes.(PhyloCSF)
➤ A much shorter ORF(open reading frame) than that of genes(they
don’t necessarily have, if have, have one short and by chance or
they are originally genes?)
➤ When forcely translated into protein, there is no counterpart in
nr database(none redundant protein database).(Blastx)
➤ They are consistently and significantly expressed at least in one
type of cell.
2. Extract records satisfying all the characteristics above.
A pipeline is established based on this idea.
8. Theoretically the pipeline works like this…
➤ The biggest circle represents the whole searching space.
➤ small rectangles inside the big circle represent subset of records in the whole searching space, which satisfy certain lncRNA
charateristic.
➤ The intersection of all the small rectangles representing the predicted set of lncRNA transcripts.
all the transcripts
less conserved ones
no counterpart in nrdatabase
short ORF
true positive expression
Predicted
lncRNAs
9. What do we get by each step? (take multiexon transcripts as examples)
➤ Find transcripts with short ORF(length < 150)
Because each record in fasta file contains two rows, there are actually n/2 records.
10. What do we get by each step? (take multiexon transcripts as examples)
➤ Find transcripts with no counterpart in nr database(E-value threshold > 10E-4 )
11. What do we get by each step? (take multiexon transcripts as examples)
➤ Find transcripts are consistently and significantly expressed for all replicates in at least
one type of cell (fpkm > 0.1)
12. Classification of lncRNAs
➤ sense and antisense lncRNAs
➤ sense lncRNAs can be classified into
intergenic, cons, incs, ponds lncRNAs