Predicting lncRNA Transcripts Out of Comprehensive
Rat Renal Cell type-specific Transcriptome Libraries
Gui Chen
11/20/2015
WHY LONG NON CODING RNA?
➤ Many long non-coding transcripts
(lncRNAs) function in a variety of
responses which include differentiation,
cell cycle, and maintenance of stem-cell
like phenotypes, and are cell-type specific
in their expression. Yet, very little is
known about their regulation or roles in
disease states.
➤ A newly established rat renal gene
expression database and recently
assembled rn6 genome sequecne have
paved a way for us to conduct such study.
WHAT IS EXACTLY THE DATA SOURCE?
➤ 110(renal tubule segments) +
5(glomeruli) renal cell-type specific gene
expression profiles as a product of work
described in the paper shown left.
➤ 7 polyadenylated mRNA-seq(PA-seq) &

cortical collecting duct(4 control rat
and 4 water loaded rat)
➤ Totally 125 libraries

WHAT IS THE FORMAT OF THE DATA
➤ Original transcripts data are stored in
GTF format which is a flat tab-delimited
file format that can be directly loaded
into excel.
➤ Next is a real case example of what GTF
records looks like.

GTF FILE EXAMPLE
How can we pick out those transcripts that potentially are long
non coding RNA transcripts from thousands of transcripts?
1. What are the characteristics of lncRNA from preliminary data and
experience?
➤ Less conserved than protein-coding genes.(PhyloCSF)
➤ A much shorter ORF(open reading frame) than that of genes(they
don’t necessarily have, if have, have one short and by chance or
they are originally genes?)
➤ When forcely translated into protein, there is no counterpart in
nr database(none redundant protein database).(Blastx)
➤ They are consistently and significantly expressed at least in one
type of cell.

2. Extract records satisfying all the characteristics above.
A pipeline is established based on this idea.
Theoretically the pipeline works like this…
➤ The biggest circle represents the whole searching space.
➤ small rectangles inside the big circle represent subset of records in the whole searching space, which satisfy certain lncRNA
charateristic.
➤ The intersection of all the small rectangles representing the predicted set of lncRNA transcripts.
all the transcripts
less conserved ones
no counterpart in nrdatabase
short ORF
true positive expression
Predicted

lncRNAs
What do we get by each step? (take multiexon transcripts as examples)
➤ Find transcripts with short ORF(length < 150)
Because each record in fasta file contains two rows, there are actually n/2 records.
What do we get by each step? (take multiexon transcripts as examples)
➤ Find transcripts with no counterpart in nr database(E-value threshold > 10E-4 )
What do we get by each step? (take multiexon transcripts as examples)
➤ Find transcripts are consistently and significantly expressed for all replicates in at least
one type of cell (fpkm > 0.1)

Classification of lncRNAs
➤ sense and antisense lncRNAs
➤ sense lncRNAs can be classified into
intergenic, cons, incs, ponds lncRNAs
RESULT
THANK YOU
& Happy Thanksgiving!

Practicum Pressentation PDF

  • 1.
    Predicting lncRNA TranscriptsOut of Comprehensive Rat Renal Cell type-specific Transcriptome Libraries Gui Chen 11/20/2015
  • 2.
    WHY LONG NONCODING RNA? ➤ Many long non-coding transcripts (lncRNAs) function in a variety of responses which include differentiation, cell cycle, and maintenance of stem-cell like phenotypes, and are cell-type specific in their expression. Yet, very little is known about their regulation or roles in disease states. ➤ A newly established rat renal gene expression database and recently assembled rn6 genome sequecne have paved a way for us to conduct such study.
  • 3.
    WHAT IS EXACTLYTHE DATA SOURCE? ➤ 110(renal tubule segments) + 5(glomeruli) renal cell-type specific gene expression profiles as a product of work described in the paper shown left. ➤ 7 polyadenylated mRNA-seq(PA-seq) &
 cortical collecting duct(4 control rat and 4 water loaded rat) ➤ Totally 125 libraries

  • 5.
    WHAT IS THEFORMAT OF THE DATA ➤ Original transcripts data are stored in GTF format which is a flat tab-delimited file format that can be directly loaded into excel. ➤ Next is a real case example of what GTF records looks like.

  • 6.
  • 7.
    How can wepick out those transcripts that potentially are long non coding RNA transcripts from thousands of transcripts? 1. What are the characteristics of lncRNA from preliminary data and experience? ➤ Less conserved than protein-coding genes.(PhyloCSF) ➤ A much shorter ORF(open reading frame) than that of genes(they don’t necessarily have, if have, have one short and by chance or they are originally genes?) ➤ When forcely translated into protein, there is no counterpart in nr database(none redundant protein database).(Blastx) ➤ They are consistently and significantly expressed at least in one type of cell.
 2. Extract records satisfying all the characteristics above. A pipeline is established based on this idea.
  • 8.
    Theoretically the pipelineworks like this… ➤ The biggest circle represents the whole searching space. ➤ small rectangles inside the big circle represent subset of records in the whole searching space, which satisfy certain lncRNA charateristic. ➤ The intersection of all the small rectangles representing the predicted set of lncRNA transcripts. all the transcripts less conserved ones no counterpart in nrdatabase short ORF true positive expression Predicted
 lncRNAs
  • 9.
    What do weget by each step? (take multiexon transcripts as examples) ➤ Find transcripts with short ORF(length < 150) Because each record in fasta file contains two rows, there are actually n/2 records.
  • 10.
    What do weget by each step? (take multiexon transcripts as examples) ➤ Find transcripts with no counterpart in nr database(E-value threshold > 10E-4 )
  • 11.
    What do weget by each step? (take multiexon transcripts as examples) ➤ Find transcripts are consistently and significantly expressed for all replicates in at least one type of cell (fpkm > 0.1)

  • 12.
    Classification of lncRNAs ➤sense and antisense lncRNAs ➤ sense lncRNAs can be classified into intergenic, cons, incs, ponds lncRNAs
  • 13.
  • 14.
    THANK YOU & HappyThanksgiving!