Bioinformatic jc 08_14_2013_formal
Upcoming SlideShare
Loading in...5
×
 

Bioinformatic jc 08_14_2013_formal

on

  • 460 views

Li Lei (KSU)

Li Lei (KSU)
Poly-A read mapping in Arabidopsis.

https://sites.google.com/site/toomajianlab/Home/people

Statistics

Views

Total Views
460
Slideshare-icon Views on SlideShare
280
Embed Views
180

Actions

Likes
0
Downloads
8
Comments
0

19 Embeds 180

http://bioinformaticsk-state.blogspot.com 144
http://bioinformaticsk-state.blogspot.co.uk 10
http://bioinformaticsk-state.blogspot.in 3
http://bioinformaticsk-state.blogspot.sg 3
http://bioinformaticsk-state.blogspot.de 3
http://bioinformaticsk-state.blogspot.kr 2
http://bioinformaticsk-state.blogspot.it 2
http://bioinformaticsk-state.blogspot.com.es 2
http://bioinformaticsk-state.blogspot.no 1
http://bioinformaticsk-state.blogspot.jp 1
http://bioinformaticsk-state.blogspot.com.au 1
http://bioinformaticsk-state.blogspot.tw 1
http://bioinformaticsk-state.blogspot.nl 1
http://bioinformaticsk-state.blogspot.ro 1
http://bioinformaticsk-state.blogspot.fr 1
http://bioinformaticsk-state.blogspot.com.ar 1
http://bioinformaticsk-state.blogspot.com.br 1
http://bioinformaticsk-state.blogspot.ru 1
http://bioinformaticsk-state.blogspot.fi 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Bioinformatic jc 08_14_2013_formal Bioinformatic jc 08_14_2013_formal Presentation Transcript

    • Genome-wide variation of alternative polyadenylation in sense and antisense transcription in Arabidopsis accessions   Li Lei Plant Pathology, KSU lilei@ksu.edu August 14, 2013
    • Outline •  Background Ø  Pre-mRNA processing & polyadenylation Ø  Alternative polyadenylation (APA) Ø  APA in plants & unknown questions •  Objective •  Method Ø  Approach Ø  PALMapper: map RNA-seq reads to reference Ø  How I retrieved the poly(A) reads •  Result Ø  Evidence for APA Ø  Poly(A) site location & related gene annotation •  Conclusion •  Outlook •  Acknowledgements
    • Background Eukaryotic pre-mRNA processing & polyadenylation poly(A)  site  (PAS)   •  poly(A) site = PAS •  Some genes, PASs of their mRNAs only in one place •  Other, PASs of their mRNAs in different places Freitag, et al. 2012 TCT GAG AAA AGT AAG TAA ... ... CAG GC CCT AGA CTG TAG.. S E K S K * S P R L * Aspergillus nidulans: pgkA (PGK)c pA1 pA2 RFP–PTS1PgkAGFP–Sps19DIC Merge pA1 -LPGVAALSEKSK* –53.5 pA2 -LPGVAALSEKSPRL* +3.1 ESTs C terminus PTS1 (score) Alternative polyadenylation (APA): Different mRNAs transcribed from the same gene have different PASs
    • Alternative polyadenylation (APA) Background thus allowing these transcripts to evade miRNA- mediated degradation. Transcripts are also subject to transcript degradation but also stability. In a genome- wide computational analysis of sequence and stability Figure 1 (a) Ex1 Ex3 PASPAS Ex2 Ex1 Ex3Ex2Ex1 Ex3Ex2 (b) Ex1 Ex3Ex2Ex1 Ex2 Ex1 Ex3 PASPAS Ex2 5′ 5′ 3′ 5′ 5′ 5′ 5′3′ 3′ 3′ 3′ 3′ Current Opinion in Cell Biology Major categories of APA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, then dentical proteins are produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity of protein produced may be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced when Mueller, et al. 2012 Tian, et al. 2013 differentiated cells are reprogrammed to ES cell-like in- duced pluripotent stem (iPS) cells [41]. A notable excep- tion, however, has been observed with spermatogonial germ cells, whose reprogramming to ES cells involves 30 UTR lengthening [41]. Notably, this is in line with the fact that germ cells are more proliferative than ES cells. Simi- lar trends of 30 UTR length regulation have been reported for comparisons of ES cells versus neural stem/progenitor (NSP) cells or neurons [42]. Although these studies have all pointed to a connection between 30 UTR length and cell proliferation, cardiac hypertrophy, in which myocytes grow in size rather than in number, has also been found to involve 30 UTR shortening [43]. Thus, a general rule may be that APA regulation is correlated with cell growth. Cancer Cancer cells are of co with this, and consist been found to express, UTRs, as first shown mouse B-cell leukem recently in human colo lung cancers [47]. In t profile was found to subtypes with differe its relevance to cance nostic marker. One ke in cancer is whether p major driver of APA. M transformed and non dicted proliferation ra transformation has a [44]. However, a recen the same cells (BJ prim lial cell line MCF10A) formed states, pro determinant of 30 UTR of 30 UTR regulation i that, compared to MC and MB231 show sho spectively. Notably, it to the general trend, adhesion genes, tend t UTRs in cancer cells [4 delineated how APA o different cancer types APA is modulated by miRNA RBP TranslaƟon DegradaƟonLocalizaƟon AAAnCDS CDS cUTR aUTR !! AAA AAA n TiBS Figure 2. Regulation of cis elements in 30 untranslated regions (UTRs) by alternative cleavage and polyadenylation (APA). Two mRNA isoforms are mediated degradation. Transcripts are also subject to wide computational analysis of sequence and stability Figure 1 (a) Ex1 Ex3 PASPAS Ex2 Ex1 Ex3Ex2Ex1 Ex3Ex2 (b) Ex1 Ex3Ex2Ex1 Ex2 Ex1 Ex3 PASPAS Ex2 5′ 5′ 3′ 5′ 5′ 5′ 5′3′ 3′ 3′ 3′ 3′ Current Opinion in Cell Biology Major categories of APA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, then identical proteins are produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity of protein produced may be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced when the proximal PAS is chosen. Ex = exon, PAS = polyadenylation site; thick lines = UTR regions, thin lines = intronic regions. www.sciencedirect.com Current Opinion in Cell Biology 2013, 25:222–232 Ex1 Ex3Ex2Ex1 Ex3Ex2 (b) Ex1 Ex3Ex2Ex1 Ex2 Ex1 Ex3 PASPAS Ex2 5′ 3′ 5′ 5′ 5′ 5′3′ 3′ 3′ 3′ Current Opinion in Cell Biology Major categories of APA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, then identical proteins are produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity of protein produced may be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced when the proximal PAS is chosen. Ex = exon, PAS = polyadenylation site; thick lines = UTR regions, thin lines = intronic regions. www.sciencedirect.com Current Opinion in Cell Biology 2013, 25:222–232 Adapted from Tress et al. 2007 Protein  isoforms   depletion at t downstream tioning migh ing the rate o these observ mental studie and to estab nucleosome o Anotherw to affect APA genetic effect tissues, in tw Napl15), whi genes(namel cases,thepro are therefore Nature Reviews | Genetics Neuron activity Proliferation Cancer Oculopharyngeal muscular dystrophy Global APA Biological processes Connections to disease Favour distal poly(A) site usage Favour proximal poly(A) site usage Figure 3 | Biological processes that have been linked with broad APA modulation. A schematic showing the biological processes and diseases that alternative polyadenylation(APA)hasbeenlinkedwith.Inaddition,thetendencytowardsdistal orproximalpoly(A)siteusageisshown. Elkon, et al. 2013
    • APA in plants and unknown questions? Background Although genome-wide investigation of polyadenylation in single Arabidopsis accession, we still do not know: 1.  How much variation in the polyadenylation usage across Arabidopsis accessions? What is the genetic basis for such variation? Cis regulation? Trans? 2.  Is Arabidopsis an outlier for any of the trends of polyadenylation site usage compared with related species? How has APA evolved across related species?         Genome-wide landscape of polyadenylation in Arabidopsis provides evidence for extensive alternative polyadenylation Xiaohui Wua,b , Man Liua , Bruce Downiec , Chun Lianga , Guoli Jib , Qingshun Q. Lia,b,1 , and Arthur G. Huntd,1 a Department of Botany, Miami University, Oxford, OH 45056; b Department of Automation, Xiamen University, Xiamen, Fujian 361005, People’s Republic of China; and c Department of Horticulture and Seed Biology Group, and d Department of Plant and Soil Sciences, University of Kentucky, Lexington, KY 40546-0312. Edited by David C. Baulcombe, University of Cambridge, Cambridge, United Kingdom, and approved June 8, 2011 (received for review January 14, 2011) Alternative polyadenylation (APA) has been shown to play an important role in gene expression regulation in animals and plants. However, the extent of sense and antisense APA at the genome level is not known. We developed a deep-sequencing protocol that queries the junctions of 3′UTR and poly(A) tails and confidently maps the poly(A) tags to the annotated genome. The results of this mapping show that 70% of Arabidopsis genes use more than one poly(A) site, excluding microheterogeneity. Analy- sis of the poly(A) tags reveal extensive APA in introns and coding sequences, results of which can significantly alter transcript se- quences and their encoding proteins. Although the interplay of intron splicing and polyadenylation potentially defines poly(A) site uses in introns, the polyadenylation signals leading to the use of CDS protein-coding region poly(A) sites are distinct from the rest of the genome. Interestingly, a large number of poly(A) sites correspond to putative antisense transcripts that overlap with the promoter of the associated sense transcript, a mode pre- viously demonstrated to regulate sense gene expression. Our results suggest that APA plays a far greater role in gene expres- sion in plants than previously expected. alternative processing | antisense transcription | nonstop mRNAs The polyadenylation of mRNA in eukaryotes is an important step in gene expression in eukaryotes. With few exceptions, mature eukaryotic mRNAs possess a poly(A) tract, that in turn functions to facilitate transport of the mRNA to the cytoplasm and its subsequent stabilization and translation. The poly(A) tail contributes regulatory information to each of these processes through interactions with RNA processing factors and poly(A)- binding proteins. The process of polyadenylation also contributes to regulation by “determining” the composition of the mRNA apart from the poly(A) tail. Thus, the position along the gene where the pre-mRNA is processed and polyadenylated deter- mines the sequence content in terms of exons and regulatory motifs. If a gene possesses more than one polyadenylation site, then the nature of the expressed mRNA can be altered via dif- ferential choice of these sites, a process that is called alternative polyadenylation, or APA. That APA may be important is sug- gested by the observations that more than 50% of human and plant genes have multiple poly(A) sites (1–5). APA may be an important factor in the regulation of genes associated with can- cer and with early embryo development in animals (6–8). APA the FLC gene (15, 16); these antisense transcripts are involved in transcriptional regulation of sense FLC mRNAs through chro- matin modifications in the vicinity of the sense FLC promoter. The regulation of these two genes thus provides examples of two modes of APA, involving intronic polyadenylation and 3′ end processing of antisense transcripts. Plant poly(A) site datasets (3, 17) have been assembled from the analysis and curation of the results of EST and full-length cDNA sequencing projects. Unfortunately, these projects are not specially targeted to the identification of poly(A) sites, nor are they high-throughput. With this consideration in mind, a strategy designed to specifically query the mRNA-poly(A) junction on a transcriptome-wide basis was developed and used to study poly(A) site choice in Arabidopsis leaves and seeds. The results obtained using this strategy reveal an extensive network of po- tential APA in Arabidopsis, including unanticipated and novel modes of APA. In addition, the results corroborate other reports suggestive of wide-spread antisense transcription in Arabidopsis, and provide a dataset of poly(A) sites associated with antisense transcripts. Finally, they provide evidence for tissue-specific poly(A) site choice. Results Preparation and Characterization of cDNA Tags That Query Poly- adenylation Sites. To study Arabidopsis poly(A) sites on a genome- wide basis, short DNA tags that include the mRNA-poly(A) site junction [called poly(A) tags, or PATs hereafter] were prepared and sequenced; the starting materials for these samples were RNA isolated from dry seeds and the leaves of young seed- lings. The initial sequences were processed and mapped to the Arabidopsis reference genome. After removing potential internal priming candidates and eliminating tags that mapped to chlo- roplast and mitochondria genomes and to miscellaneous RNAs (primarily rRNAs), a collection of tags that defined more than 280,000 individual poly(A) sites were obtained (Table S1). Be- cause poly(A) site microheterogeneity is ubiquitous in plants (3, 4), poly(A) sites in the same gene that are located within 24 nt of each other were clustered so as to define a poly(A) site cluster (PAC). The results of this process were more than 71,000 PACs with an average of 54 PATs per PAC (Table S1). Of these PACs, 57,473 were in the “sense” orientation with respect to an anno- Author contributions: X.W., M.L., G.J., Q.Q.L., and A.G.H. designed research; X.W., M.L., NATURE STRUCTURAL & MOLECULAR BIOLOGY VOLUME 19 NUMBER 8 AUGUST 2012 845 R E S OU RC E Arabidopsis thaliana is an important model system that has had a critical role in discoveries essential to our understanding of plant biology and of generically important processes such as RNA interfer- ence (RNAi). Although the A. thaliana genome was sequenced more than a decade ago, challenges remain in resolving the RNAs that it encodes and determining their functional significance. Establishing where transcripts end is essential in genome annotation and for understanding gene function. Alternative cleavage and polyadenyla- tion (APA) defines different 3 ends within pre-mRNA transcribed from the same gene, and this can affect function by determining coding potential or the inclusion of regulatory sequence elements1,2. This regulation of RNA 3 -end formation is considerably more wide- spread than previously thought1,2, and RNA-binding proteins that enable A. thaliana flowering provide important examples of the biological impact of this control3. Defective 3 -end formation and transcription termination at tandem or convergent gene pairs can result in transcription interference or RNAi4,5, revealing that these processes normally partition the genome and maintain expression of neighboring genes6. Accordingly, such consequences of uncontrolled 3 -end formation also emphasize the critical nature of gene arrange- ment along a eukaryotic chromosome. As a prelude to the analysis of regulators of 3 -end formation, we set out to map A. thaliana RNA 3 ends genome-wide. Previous high-throughput A. thaliana transcriptome studies have depended on the copying of RNA into complementary DNA (cDNA) with reverse transcriptase7–10. However, the intrinsic template switch- ing11 and DNA-dependent DNA-polymerase12 activities of reverse transcriptases, together with oligo(dT)-dependent internal priming13, cause well-established artifacts that can affect the identification of authentic antisense RNAs14,15, splicing events14 and RNA 3 ends13,16. Different strategies have been developed to address these problems, making strand-specific RNA sequencing an increasingly powerful tool for the analysis of transcriptomes. However, a recent comparison of several such methods showed marked differences not only in strand specificity but also in a range of criteria that influence transcriptome interpretation17. Therefore, as an alternative, we used direct RNA sequencing (DRS) to identify polyadenylated A. thaliana RNAs18. This approach is direct in the sense that native RNA is used as the sequencing template, but the sequence is read by imaging comple- mentary fluorescent nucleotides incorporated by a polymerase. In this true single-molecule sequencing (tSMS) procedure, the site of RNA cleavage and polyadenylation is defined with an accuracy of 2 nucleotides (nt) in the absence of errors induced by reverse transcriptase, ligation or amplification18. RESULTS Mapping A. thaliana RNA 3 ends Total RNA purified from A. thaliana seedlings was subjected to DRS, and a computational procedure to align reads uniquely to the most recent A. thaliana genome release (currently TAIR10) was developed. The initial mapping analysis revealed that the vast majority of reads (89.60%) aligned to protein-coding genes, which is consistent with the idea that this approach can identify authentic sites of mRNA cleavage and polyadenylation (Fig. 1a). These data define extremely heterogeneous patterns of RNA 3 -end formation (Fig. 1b) that differ markedly from those of human mRNAs analyzed in the same way (Supplementary Fig. 1a)18. Although nontemplated base addition between cleavage sites and the poly(A) tail has been reported from analysis of A. thaliana expressed- sequence-tag (EST) data19, we found no evidence for this phenomenon 1College of Life Sciences, University of Dundee, Dundee, UK. 2Department of Cell and Molecular Sciences, James Hutton Institute, Invergowrie, Dundee, UK. 3Helicos BioSciences Corporation, Cambridge, Massachusetts, USA. Correspondence should be addressed to G.G.S. (g.g.simpson@dundee.ac.uk) or G.J.B. (g.j.barton@dundee.ac.uk). Received 16 February; accepted 19 June; published online 22 July 2012; doi:10.1038/nsmb.2345 Direct sequencing of Arabidopsis thaliana RNA reveals patterns of cleavage and polyadenylation Alexander Sherstnev1, Céline Duc1, Christian Cole1, Vasiliki Zacharaki1, Csaba Hornyik2, Fatih Ozsolak3, Patrice M Milos3, Geoffrey J Barton1 & Gordon G Simpson1,2 It has recently been shown that RNA 3 -end formation plays a more widespread role in controlling gene expression than previously thought. To examine the impact of regulated 3 -end formation genome-wide, we applied direct RNA sequencing to A. thaliana. Here we show the authentic transcriptome in unprecedented detail and describe the effects of 3 -end formation on genome organization. We reveal extreme heterogeneity in RNA 3 ends, discover previously unrecognized noncoding RNAs and propose widespread reannotation of the genome. We explain the origin of most poly(A)+ antisense RNAs and identify cis elements that control 3 -end formation in different registers. These findings are essential to understanding what the genome actually encodes, how it is organized and how regulated 3 -end formation affects these processes. npg©2012NatureAmerica,Inc.Allrightsreserved. (AtCPSF30) (AtCPSF30*-YT521B) FLC OXT6 D P P D a a a b c b b c c FIGURE 2 | Schematic representation of alternative polyadenyla Xing, et al. 2012PAS2 PAS1 Gene Transcript1 Transcript2
    • Investigate genome-wide variation of alternative polyadenylation in sense and antisense transcription across a set of Arabidopsis thaliana accessions   Objective Objective •  Is variation in APA as prevalent across genotypes as across tissue types? •  Is there genetic basis for variation related to the trans regulation as well as cis of APA? •  Does a gene’s proximity to neighboring genes constrain polyadenylation site choice and limit variation?
    • Approach Method 82 bp Strand-specific RNA-seq Map reads to each corresponding genome-- PALMapper Transform read positions from each transcriptome into a common coordinate system based on a multiple-genome alignment Retrieve polyA-containing reads, cluster across all accessions and identify poly(A) site (PAS) Generate read counts for each PAS for each accession Compare PASs genome-wide across accessions 19 accessions (genome sequenced) SeedlingRoot Floral bud RNA extraction & library construction with barcode
    • PALMapper: map RNA-seq reads to reference •  PALMapper (Jean, et al. 2010) •  A combination of: the spliced alignment method QPALMA (De Bona, et al. 2008) the short read alignment tool GenomeMapper (Schneeberger, et al. 2009) http://ftp.raetschlab.org/software/palmapper/palmapper-0.5.tar.gz Version  0.5  released:   Method Adapted from Kahles, et al. 2013 talk Another Mapper? Memorial Sloan-Kettering Cancer Cente Advantages: •  Alignments with variants, e.g. mismatches, indels •  Accurate spliced alignments using computational splice site predictions •  More accurate than TopHat (e.g. C. elegance 47% & 81%, respectively) •  Fast alignments (about 10 million reads/hour) •  Softtrimming for polyA tail of each read
    • Softtrimming •   The sequence remain in bam file •  Annotated with cigar “S” annotation •  Ignored by many tools such as the IGV
    • How did I retrieve the poly(A) reads? Method The mapped sam file with softtrimmed poly(A) Softtrimming +   5’   3’   RNAseq_reads   5’   3’   Genome   5’   3’   AAAAAAAA 5’   +   Splicing  length  >=1500bp     Perl programming to pick up Poly(A) reads Consecutive As in 3’ end of reads >=8bp Quality score of each A >=40 Huge splicing
    • Defining poly(A) clusters (PAS) Result Identify poly(A) reads across accessions 2,203,313   Cluster poly(A) reads: 75,532 PASs •  In the same orientation •  Within 10bp of each other across all accessions •  Total cluster interval spanning <= 24bp Map PASs to genic regions (±120bp to the annotated range): •  93.4% PASs map to genic regions •  6.6% PASs further away from genic regions Consider the sense & antisense PASs: •  Poly(A) reads orientation relative to the gene orientation •  6581 genes with >= 20 sense poly(A) reads across accessions •  1473 genes with >= 10 antisense poly(A) reads across accessions
    • Reads mapping to the major and non-major poly(A) cluster within gene Result •  Major PAS: the PAS with the most reads across all accessions for each gene •  p = proportion of total reads in gene mapping to major PAS •  q = 1-p = proportion of total reads in gene mapping to non-major PASs
    • The distribution of the proportion of reads mapping to non- major sense & antisense poly(A) clusters per gene Genes with the proportion of non-major cluster reads equal to or greater than 0.4 ( indicated with gray dashed lines) were considered as containing alternative poly(A) sites and chosen for further polymorphic analysis Result 6581 gene with sense PASs 1471 gene with antisense PASs
    • Pairwise difference in the proportion of reads mapping to non- major poly(A) clusters across accessions Result ¯D = 1 n n 1X i=1 nX j=i+1 Dij •  For the ith and jth accessions Ai, and Aj, we can calculate their absolute difference of the proportion of reads mapping to non-major poly(A) cluster, here called Dij, Dij = |qAi – qAj| •  Average pairwise difference: Where n=19 •  Maximum pairwise difference: Dmax = max{Dij}
    • Pairwise difference in the proportion of reads mapping to non- major poly(A) clusters across accessions 3074 genes with sense PAS Result Average pairwise difference Maximum pairwise difference Dmax
    • Pairwise difference in the proportion of reads mapping to non- major poly(A) clusters across accessions 544 genes with antisense PAS Result Maximum pairwise difference DmaxAverage pairwise difference
    • Gene position and antisense PAS Result Nearby gene: the distance apart from its adjacent gene <=2kb Groups Fraction of genes in each group Fraction of genes with sense poly(A) reads >=20 Fraction of genes with proportion of non-major sense PASs>0.4 Fraction of genes with antisense poly(A) reads >=10 Fraction of genes with proportion of non- major antisense PASs>0.4 A 57.87% 62.92% 62.94% 96.91% 97.79% B 20.48% 21.30% 20.59% 1.65% 0.74% C 21.64% 15.77% 16.46% 1.43% 1.47%
    • Conclusion •  For genes with more sense & antisense poly(A) reads, half use non-major PAS at least 40% of the time •  Pairwise comparison across all accessions helped to identify the best candidate genes for polymorphism in the usage or position of major PASs Conclusion
    • Outlook •  Combine all tissues & all accessions, calculate & its variance •  Associate with gene categories, poly(A) site location of genes, etc. •  Examine the trans/cis poly(A) QTL with the MAGIC lines’ data •  Check the relationship between the antisense poly(A) site & the orientation of nearby genes, and the relationship this may have with expression level •  Check the data from related species, Capsella rubella & A. lyrata to look at APA usage & its evolution between species •  Ask if A. thaliana an outlier for any of the trends observed? if APA is derived in A. thaliana? Outlook
    • Acknowledgements Kansas State University Dr. Chris Toomajian University of Utah Dr. Richard Clark Dr. Joshua Steffen Edward J. Osborne Robert Greenhalgh Wellcome Trust Centre for Human Genetics, University of Oxford Dr. Richard Mott Memorial Sloan-Kettering Cancer Center Dr. Gunnar Raetsch Philipp Drewe Andre Kahles
    • Alternative polyadenylation (APA) Background Ex1 Ex3Ex2Ex1 Ex3Ex2 (b) Ex1 Ex3Ex2Ex1 Ex2 Ex1 Ex3 PASPAS Ex2 5′ 3′ 5′ 5′ 5′ 5′3′ 3′ 3′ 3′ Current Opinion in Cell Biology Major categories of APA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, then identical proteins are produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity of protein produced may be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced when the proximal PAS is chosen. Ex = exon, PAS = polyadenylation site; thick lines = UTR regions, thin lines = intronic regions. www.sciencedirect.com Current Opinion in Cell Biology 2013, 25:222–232 Adapted from Tress et al. 2007 Protein  isoforms  
    • Outlook •  Combine all tissues and all accessions, take each tissue as subset, calculate and its variance •  For each tissue, associate with gene categories according to GO analysis & gene families •  Compare the distribution of from different tissues, and PAS usage patterns among tissues or accessions •  Check Ka/Ks for genes with high/low in all tissues •  Check the poly(A) site location for genes with high , e.g. 3'UTR, CDS, 5'UTR or intron •  Compare the location across accessions •  Look at the relationship of location with gene expression level •  Examine the cis poly(A) QTL with the MAGIC lines’ RNA-seq data •  Check the relationship between the antisense poly(A) site and the orientation of nearby genes for each tissue subset, and the relationship this may have with expression level •  Check the data from Capsella rubella and A. lyrata to look at APA usage and its evolution between species •  Ask if A. thaliana an outlier for any of the trends observed? if APA is derived in A. thaliana? Outlook
    • Tian, et al. 2013 differentiated cells are reprogrammed to ES cell-like in- duced pluripotent stem (iPS) cells [41]. A notable excep- tion, however, has been observed with spermatogonial germ cells, whose reprogramming to ES cells involves 30 UTR lengthening [41]. Notably, this is in line with the fact that germ cells are more proliferative than ES cells. Simi- lar trends of 30 UTR length regulation have been reported for comparisons of ES cells versus neural stem/progenitor (NSP) cells or neurons [42]. Although these studies have all pointed to a connection between 30 UTR length and cell proliferation, cardiac hypertrophy, in which myocytes grow in size rather than in number, has also been found to involve 30 UTR shortening [43]. Thus, a general rule may be that APA regulation is correlated with cell growth. Cancer Cancer cells are of course hi with this, and consistent with been found to express, in gene UTRs, as first shown in tran mouse B-cell leukemia/lymp recently in human colorectal c lung cancers [47]. In the stud profile was found to be info subtypes with different surv its relevance to cancer devel nostic marker. One key questi in cancer is whether prolifera major driver of APA. Meta-an transformed and nontransfo dicted proliferation rates has transformation has a signific [44]. However, a recent study the same cells (BJ primary fib lial cell line MCF10A) in prol formed states, proliferatio determinant of 30 UTR length of 30 UTR regulation in cance that, compared to MCF10A, and MB231 show shortened spectively. Notably, it has als to the general trend, some g adhesion genes, tend to expre UTRs in cancer cells [45,46]. T delineated how APA of differe different cancer types and at APA is modulated by multi Regulation of core C/P facto miRNA RBP TranslaƟon DegradaƟonLocalizaƟon AAAnCDS CDS cUTR aUTR !! AAA AAA n TiBS Figure 2. Regulation of cis elements in 30 untranslated regions (UTRs) by alternative cleavage and polyadenylation (APA). Two mRNA isoforms are shown. The 30 UTR region upstream of the proximal cleavage and Figure 1 (a) Ex1 Ex3 PASPAS Ex2 Ex1 Ex3Ex2Ex1 Ex3Ex2 (b) Ex1 Ex3Ex2Ex1 Ex2 Ex1 Ex3 PASPAS Ex2 5′ 5′ 3′ 5′ 5′ 5′ 5′3′ 3′ 3′ 3′ 3′ Current Opinion in Cell Biology Major categories of APA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, then identical proteins are produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity of protein produced may be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced when the proximal PAS is chosen. Ex = exon, PAS = polyadenylation site; thick lines = UTR regions, thin lines = intronic regions. www.sciencedirect.com Current Opinion in Cell Biology 2013, 25:222–232
    • Alternative polyadenylation (APA) Background in abundance. One of the best-charac- is that of microRNA (miR)-mediated studies of myogenic [43,44 ], hemato- d cancer [45] cells, transcripts bearing contained fewer miRNA-binding sites, these transcripts to evade miRNA- dation. Transcripts are also subject to Upf1 binds to the 3 UTR in a length-dependent manner, thus eliciting degradation of longer transcripts more rapidly [48 ]. The 30 UTR contains elements that affect not only transcript degradation but also stability. In a genome- wide computational analysis of sequence and stability (a) Ex1 Ex3 PASPAS Ex2 Ex1 Ex3Ex2Ex1 Ex3Ex2 (b) Ex1 Ex3Ex2Ex1 Ex2 Ex1 Ex3 PASPAS Ex2 5′ 5′ 3′ 5′ 5′ 5′ 5′3′ 3′ 3′ 3′ 3′ Current Opinion in Cell Biology PA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, then produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity of be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced when hosen. Ex = exon, PAS = polyadenylation site; thick lines = UTR regions, thin lines = intronic regions. om Current Opinion in Cell Biology 2013, 25:222–232 Mueller, et al. 2012 Tian, et al. 2013 lar trends of 30 UTR length regulation have been reported for comparisons of ES cells versus neural stem/progenitor (NSP) cells or neurons [42]. Although these studies have all pointed to a connection between 30 UTR length and cell proliferation, cardiac hypertrophy, in which myocytes grow in size rather than in number, has also been found to involve 30 UTR shortening [43]. Thus, a general rule may be that APA regulation is correlated with cell growth. recentl lung ca profile subtyp its rele nostic m in canc major d transfo dicted transfo [44]. H the sam lial cel formed determ of 30 U that, co and M spectiv to the adhesi UTRs i delinea differen APA is Regula The co include subuni miRNA RBP TranslaƟon DegradaƟonLocalizaƟon AAAnCDS CDS cUTR aUTR !! AAA AAA n Ti BS Figure 2. Regulation of cis elements in 30 untranslated regions (UTRs) by alternative cleavage and polyadenylation (APA). Two mRNA isoforms are shown. The 30 UTR region upstream of the proximal cleavage and polyadenylation site (pA) is called the constitutive UTR (cUTR), and the downstream region is called the alternative UTR (aUTR). RNA-binding protein (RBP) and miRNA targeting to the aUTR are shown. Impacts on mRNA localization, translation, and degradation are indicated. CDS, coding sequence. Adapted from Tress et al. 2007 Protein  isoforms   depletion at the site and more pron downstream from it, suggesting th tioning might influence PAS use by ing the rate of polymerase elongat these observations are only corr mental studies are required in ord and to establish a cause–effect re nucleosome occupancy and poly(A Neuron activity Proliferation Cancer Oculopharyngeal muscular dystrophy Global APA Biological processes Connections to disease R Elkon, et al. 2013
    • Alternative polyadenylation (APA) Background in abundance. One of the best-charac- is that of microRNA (miR)-mediated studies of myogenic [43,44 ], hemato- d cancer [45] cells, transcripts bearing contained fewer miRNA-binding sites, these transcripts to evade miRNA- dation. Transcripts are also subject to Upf1 binds to the 3 UTR in a length-dependent manner, thus eliciting degradation of longer transcripts more rapidly [48 ]. The 30 UTR contains elements that affect not only transcript degradation but also stability. In a genome- wide computational analysis of sequence and stability (a) Ex1 Ex3 PASPAS Ex2 Ex1 Ex3Ex2Ex1 Ex3Ex2 (b) Ex1 Ex3Ex2Ex1 Ex2 Ex1 Ex3 PASPAS Ex2 5′ 5′ 3′ 5′ 5′ 5′ 5′3′ 3′ 3′ 3′ 3′ Current Opinion in Cell Biology PA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, then produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity of be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced when hosen. Ex = exon, PAS = polyadenylation site; thick lines = UTR regions, thin lines = intronic regions. om Current Opinion in Cell Biology 2013, 25:222–232 Mueller, et al. 2012 Tian, et al. 2013 lar trends of 30 UTR length regulation have been reported for comparisons of ES cells versus neural stem/progenitor (NSP) cells or neurons [42]. Although these studies have all pointed to a connection between 30 UTR length and cell proliferation, cardiac hypertrophy, in which myocytes grow in size rather than in number, has also been found to involve 30 UTR shortening [43]. Thus, a general rule may be that APA regulation is correlated with cell growth. recentl lung ca profile subtyp its rele nostic m in canc major d transfo dicted transfo [44]. H the sam lial cel formed determ of 30 U that, co and M spectiv to the adhesi UTRs i delinea differen APA is Regula The co include subuni miRNA RBP TranslaƟon DegradaƟonLocalizaƟon AAAnCDS CDS cUTR aUTR !! AAA AAA n Ti BS Figure 2. Regulation of cis elements in 30 untranslated regions (UTRs) by alternative cleavage and polyadenylation (APA). Two mRNA isoforms are shown. The 30 UTR region upstream of the proximal cleavage and polyadenylation site (pA) is called the constitutive UTR (cUTR), and the downstream region is called the alternative UTR (aUTR). RNA-binding protein (RBP) and miRNA targeting to the aUTR are shown. Impacts on mRNA localization, translation, and degradation are indicated. CDS, coding sequence. Adapted from Tress et al. 2007 Protein  isoforms   depletion at the site and more pron downstream from it, suggesting th tioning might influence PAS use by ing the rate of polymerase elongat these observations are only corr mental studies are required in ord and to establish a cause–effect re nucleosome occupancy and poly(A Neuron activity Proliferation Cancer Oculopharyngeal muscular dystrophy Global APA Biological processes Connections to disease R Elkon, et al. 2013
    • depletion at t downstream tioning migh ing the rate o these observ mental studie and to estab nucleosome o Anotherw to affect APA genetic effect tissues, in tw Napl15), whi Nature Reviews | Genetics Neuron activity Proliferation Cancer Oculopharyngeal muscular dystrophy Global APA Biological processes Connections to disease Favour distal poly(A) site usage Favour proximal poly(A) site usage Figure 3 | Biological processes that have been linked with broad APA modulation. A schematic showing the biological processes and diseases that alternative polyadenylation(APA)hasbeenlinkedwith.Inaddition,thetendencytowardsdistal orproximalpoly(A)siteusageisshown. Elkon, et al. 2013
    • hles (SKI, New York) PALMapper HiTSeq, July 20, 2013 1 Advantages: •  Alignments with variants, e.g. mismatches, indels •  Accurate spliced alignments using computational splice site predictions •  More accurate than TopHat (e.g. C. elegance 47% & 81%, respectively) •  Fast alignments (about 10 million reads/hour) •  Softtrimming for polyA tail of each read
    • How did I retrieve the poly(A) reads? The mapped sam file with softtrimmed poly(A) Reads with Softtrimmed end & consecutive As in the end Reads with long splicing length & consecutive As in the end SoItrimming     +   5’   3’   RNAseq_reads   consecuKve  As>=8   &  quality  score  of   each  soItrimmed  bp   >=40   5’   3’   Genome   5’   3’   AAAAAAAA 5’   +   Splicing  length  >=1500bp    SoItrimming   consecuKve  As>=8   &  quality  score  of  each  soItrimmed  bp  >=40   Genome   RNAseq_reads   5’   3’   AAAAAAAA 5’   +   Splicing2  length  >=1500bp    Splicing1   Splicing1  <Splicing  2   consecuKve  As>=8   &  quality  score  of  each  soItrimmed  bp  >=40   5’   3’   AAAAAAAA 5’   +   Splicing  length  >=1500bp     consecuKve  As>=8   &  quality  score  of  each  soItrimmed  bp  >=40   Perl programming to make the criteria true Method
    • Defining poly(A) clusters (PAS) •  2,203,313 poly(A) reads across accessions are identified •  Calculate the poly(A) site for each poly(A) read with Perl script •  75,532 PAS defined by clustering poly(A) reads in the same orientation and within 10bp of each other across all accessions with total cluster interval spanning no more than 24bp •  93.4% of clusters map to genic regions, and the 6.6% of clusters that are further away from genic regions •  6581 genes have at least 20 sense poly(A) reads across accessions •  1473 genes have at least 10 antisense poly(A) reads across accessions •  Major sense PAS defined across all accessions for each gene as the sense PAS with the most reads •  p = proportion of total reads in gene mapping to major PAS •  q = 1-p = proportion of total reads in gene mapping to non-major PASs Result
    • The distribution of the proportion of reads mapping to non- major sense and antisense poly(A) clusters per gene Genes with the proportion of non-major cluster reads equal to or greater than 0.4 ( indicated with gray dashed lines) were considered as containing alternative poly(A) sites and chosen for further polymorphic analysis Result
    • Pairwise difference in the proportion of reads mapping to non- major poly(A) clusters across accessions 3074 genes with sense PAS Result
    • Gene position and antisense PAS Result 10 Nearby gene: the distance apart from its adjacent gene <=2kb