This document discusses differential gene expression analysis of RNA-Seq data from oral squamous cell carcinoma samples using Bioconductor packages in R. It describes retrieving RNA-Seq data from GEO, mapping reads to a reference genome, generating a count file, and using edgeR, DESeq, and baySeq packages to identify differentially expressed genes. The results show top differentially expressed genes from each package and 215 genes common to all three. The conclusion states differential expression analysis can be successfully applied to tumor/normal RNA-Seq samples and identifies some genes that may be related to cancer.
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
undergrad thesis
1. Differential Gene Expression In Rna-
Seq Data For Oral Squamous Cell
Carcinoma Using Bioconductor
19-Apr-15 1
By:
Kasturi P Chandwadkar
BBI 8th sem
BI-12
3. INTRODUCTION
• Oral squamous cell carcinoma(OSCC) represents 90% of oral
cancer and the chances increase with the increase in age.
• Techniques for assessing and quantifying RNA by high-
throughput sequencing are collectively known as “RNA- Seq”.
• RNA-Seq has been applied to get the complex transcriptomes
/genes of mammalian samples, including human embryonic
kidney and B-cells, mouse embryonic stem cells, blastomeres,
and different mouse tissues
19-Apr-15 3
4. ADVANTAGES OF RNA SEQ
• One of the advantages of RNA-Seq over other profiling
technologies like microarray is the ability to query all
transcripts without prior knowledge about the location and
structures of genes.
• RNA-Seq is not limited to detecting transcripts that
correspond to existing genomic sequence.
• RNA-Seq has very low background signal because DNA
sequences can unambiguously mapped to unique regions of
the genome
19-Apr-15 4
5. R AND BIOCONDUCTOR PACKAGES
• R (http://cran.at.r-project.org) is a comprehensive statistical
environment and programming language for professional data
analysis and graphical display.
• Bioconductor (http://www.bioconductor.org/) provides many
additional R packages for statistical data analysis in different
life science areas, such as tools for microarray, sequence and
genome analysis.
• Packages used for differential gene expression:
• Biostrings
• biomaRt
• baySeq
• DESeq
• edgeR
19-Apr-15 5
6. Methodology
• RETRIEVAL OF NGS DATA
• The RNA-Seq data (FASTQ files) of oral squamous cell carcinoma was taken
from Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) with
accession number GSE20116
• MAPPING OF GENOMIC READS
• The short reads are mapped/aligned to the reference genome using Bowtie.
• GENERATING COUNT FILE
• A count file is matrix in which counts represent the number of times the
genomic region mapped with the reference genome and Id represents the
genomic region annotation.
• GETTING DIFFRENTAL EXPESSION GENES
• edgeR
• DESeq
• baySeq
19-Apr-15 6
7. RNA-Seq analysis pipeline for detecting DGE
19-Apr-15 7
SHORT READS
ALIGN READS TO REFERENCE GENOME
PREPARE COUNT FILE FROM SAM FILE
GET DIFFERENTIAL GENE EXPRESSION
edgeR baySeqDESeq
List of DEG List of DEG List of DEG
Venn diagram of DEG from three packages
12. Venn Diagram Of DGE With P-value Less Than 0.01
19-Apr-15 12
13. Conclusion
• We have demonstrated that our DGE method can be
successfully applied to RNA-Seq samples in tumor and
matched normal tissues.
• By using three different statistical methods for inferring
differential gene expression in oral squamous cell carcinoma
(OSCC) we got 215 genes common using three packages.
• 1054 genes are common between edgeR and DESeq, 217 are
common in between DESeq and baySeq and 278 are common
between edgeR and baySeq.
19-Apr-15 13
14. Below is table with some of the differential expressed genes in
cancer sample which may be related to cancer.
Gene id Description
KRT36 keratin, type I cuticular
ADIPOQ adiponectin C1Q and collagen domain containing
PLA2G2A Phospholipase A2, group IIA (platelets, synovial fluid)
CEACAM7 Carcinoembryonic antigen-related cell adhesion molecule
SPINK7 Serine peptidase inhibitor, Kazal type 7 (putative)
esophagus cancer related gene 22
ALDH1A2 Aldehyde dehydrogenase 1 family, member
ENDOU Endonuclease, polyU-specific
ANGPTL1 Angiopoietins
GDF10 Growth differentiation factor 10
TUSC5 Tumor suppressor candidate 5
4/19/2015 14
15. REFERENCES
• [1] Published online 15 October 2008 | Nature 455, 847 (2008) |
doi:10.1038/455847a
• [2] A scaling normalization method for differential expression analysis of RNA-seq
data Mark D Robinson1,2*, Alicia Oshlack1*
• [3] Tumor Transcriptome Sequencing Reveals Allelic Expression Imbalances
Associated with Copy Number Alterations. Brian B. Tuch1., Rebecca R. Laborde2.,
Xing Xu1, Jian Gu3, Christina B. Chung1, Cinna K. Monighetti1.
• [4] Ultrafast and memory-efficient alignment of short DNA sequences to the
human genome
• Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg
• [5] V. Costa, A. Casamassimi, and A. Ciccodicola, “Nutritional genomics era:
opportunities toward a genome-tailored nutritional regimen,” The Journal of
Nutritional Biochemistry, vol. 21, no. 6, pp. 457–467, 2010.
• [6] E. Birney, J. A. Stamatoyannopoulos, A. Dutta, et al., “Identification and analysis
of functional elements in 1% of the human genome by the ENCODE pilot project,”
Nature, vol. 447, no. 7146, pp. 799–816, 2007.
• [7] F. S. Collins, E. S. Lander, J. Rogers, and R. H. Waterson, “Finishing the
euchromatic sequence of the human genome,” Nature, vol. 431, no. 7011, pp.
931–945, 2004.
• [8] International Human Genome Sequencing Consortium, “A haplotype map of
the human genome,” Nature, vol. 437, no. 7063, pp. 1299–1320, 2005.
19-Apr-15 15