The document describes a study that integrated The Cancer Genome Atlas (TCGA) pan-cancer data with experimentally defined microRNA-target interactions from Argonaute Crosslinking Immunoprecipitation (AGO-CLIP) data to identify a coregulated "superfamily" of oncogenic microRNAs. Analysis revealed a pan-cancer microRNA superfamily consisting of the miR-17, miR-19, miR-130, miR-93, miR-18, miR-455 and miR-210 seed families that cotarget critical tumour suppressors through a central GUGC core motif. This microRNA superfamily was found to cotarget the phosphoinositide 3-kinase, TGFβ and p53 pathways in many
This document describes a bioinformatics approach for prioritizing disease-causing human mitochondrial DNA mutations. It uses a "disease score" that averages the probabilities from six pathogenicity prediction methods to determine if a mutation is deleterious or benign. The approach was trained on 53 known disease-associated mutations and tested on 1872 observed mutations, achieving a disease score cutoff of 0.4311. Mutations meeting criteria of being rare, conserved, predicted pathogenic and occurring at low variability sites were prioritized. When tested on 21 tumor samples, 8/268 prioritized nonsynonymous mutations were tumor-specific, confirming the approach's ability to identify potentially pathogenic mutations.
HDAC4 and HDAC7 Promote Breast and Ovarian Cancer Cell Migration by Regulatin...CrimsonpublishersCancer
Breast and ovarian cancer have been remained as a highly malignant tumor among women, posing a serious threat to women health worldwide. In this study, we were aimed to investigate the underlying mechanism of breast and ovarian cancer cell migration. Wound healing assay showed that MDA-MB-231and C13* have higher migration potential compare with MCF-7 and OV2078 cells, as well as regulated epithelial-mesenchymal transition (EMT) marker. We found that HDAC4 and HADC7 mRNA are up regulated in MDA-MB-231 and C13* cells. Moreover, target HDAC4 and HDAC7 by TSA or shRNA block MDA-MB-231and C13* migration. These results reveal a new link between HDACs and EMT in the regulation of breast and ovarian cancer migration.
- 13 putative driver genes were identified from TCGA somatic mutation data including TP53, CTNNB1, and AXIN1. Some genes were associated with clinical characteristics and survival outcomes.
- Multi-omic data from TCGA including RNA-seq, miRNA-seq, DNA methylation, and copy number variation were integrated using similarity network fusion and clustering. This identified 5 subtypes of HCC with different survival profiles.
- Predictive models were built for each subtype on individual omics datasets, showing concordance between 56-87%. Analysis of gene expression profiles revealed different active biological processes between the subtypes.
Gene expression mining for predicting survivability of patients in earlystage...ijbbjournal
After numerous breakthroughs in medicine, microbiology, and pathology in the past century, lung cancer
still remains as a leading cause of cancer-related death even in the developed countries. Lung cancer
accounts roughly for 30% of all cancer-related deaths in the world. Diagnosis and treatments are still
based on traditional histopathology. It is of paramount importance to predictthe survivability of patients in
early stages oflung cancer so that specific treatments can be sought. Nonetheless, histopathology has been
shown by previous studies to be inadequate in predicting lung cancerdevelopment and clinical outcome.
The microarray technology allows researchers to examine the expression of thousands of genes
simultaneously. This paper describes a state-of-the-art machine learning based approach called averaged
one-dependence estimators with subsumption resolution to tackle the problem of predictingwhether a
patient in early stages of lung cancer will survive by mining DNA microarray gene expression data. To
lower the computational complexity, we employ an entropy-based geneselection approach to select relevant
genes that are directly responsible for lungcancer survivability prognosis. The proposed system has
achieved an average accuracy of 92.31% in predicting lung cancer survivability over 2 independent
datasets. The experimental results provide confirmation that gene expression mining can be used to predict
survivability of patients in earlystages of lung cancer.
Translation of microarray data into clinically relevant cancer diagnostic tes...Tapan Baral
This study aimed to develop a simple and inexpensive diagnostic test to distinguish between malignant pleural mesothelioma (MPM) and lung adenocarcinoma (ADCA) based on gene expression ratios, as current methods can be challenging. The researchers used microarray data from 31 MPM and 150 ADCA samples to identify genes with highly correlated expression levels between the two cancer types. They tested the accuracy of diagnostic ratios formed from combinations of two or three of these genes in differentiating between MPM and ADCA in 149 additional samples. Using two or three gene expression ratios achieved 95% and 99% accurate differential diagnosis, respectively, demonstrating this approach may provide a clinically useful diagnostic tool.
Michael's IUCRL Poster 2014 Close to Final with CDW editsMichael Araya
1) The study analyzed gene expression data from The Cancer Genome Atlas to identify genes that correlate with Amot in breast cancer and AmotL2 in thyroid cancer.
2) Pathway analysis revealed several cancer-related pathways that covaried with Amot levels in breast cancer, including Wnt signaling and TGF-beta signaling.
3) Genes that positively correlated with Amot in breast cancer and negatively correlated with AmotL2 in thyroid cancer were enriched for those involved in tumor growth, metastasis, and cancer stem cells.
This document describes a bioinformatics tool that predicts gene expression in cancer using copy number alterations and methylation data as predictors in a linear model. The tool was created using data from The Cancer Genome Atlas and the R programming language. It analyzes specific cancers to identify oncogenes and tumor suppressor genes that are relevant to those cancers, and determines which genetic or epigenetic factors drive the expression of those genes. The tool provides comprehensive analysis of individual genes and ranks their relevance to different cancers. It allows researchers to efficiently understand disease-specific gene expression and identify additional genes involved in cancer.
This study compared genomic data from 39 HNSCC cell lines to genomic findings from 106 HNSCC tumors. Amplification of eight genes and deletion of five genes were found in both cell lines and tumors. Seventeen genes were only mutated in cell lines, suggesting these mutations arose in tissue culture. Conversely, 11 genes were only mutated in over 10% of tumors. Several mutant genes in the EGFR pathway were shared between cell lines and tumors. Pharmacologic profiling of six cell lines suggested PIK3CA mutation may predict sensitivity to EGFR/PI3K pathway drugs. These findings suggest correlating gene mutations between cell lines and tumors can guide selection of preclinical models.
This document describes a bioinformatics approach for prioritizing disease-causing human mitochondrial DNA mutations. It uses a "disease score" that averages the probabilities from six pathogenicity prediction methods to determine if a mutation is deleterious or benign. The approach was trained on 53 known disease-associated mutations and tested on 1872 observed mutations, achieving a disease score cutoff of 0.4311. Mutations meeting criteria of being rare, conserved, predicted pathogenic and occurring at low variability sites were prioritized. When tested on 21 tumor samples, 8/268 prioritized nonsynonymous mutations were tumor-specific, confirming the approach's ability to identify potentially pathogenic mutations.
HDAC4 and HDAC7 Promote Breast and Ovarian Cancer Cell Migration by Regulatin...CrimsonpublishersCancer
Breast and ovarian cancer have been remained as a highly malignant tumor among women, posing a serious threat to women health worldwide. In this study, we were aimed to investigate the underlying mechanism of breast and ovarian cancer cell migration. Wound healing assay showed that MDA-MB-231and C13* have higher migration potential compare with MCF-7 and OV2078 cells, as well as regulated epithelial-mesenchymal transition (EMT) marker. We found that HDAC4 and HADC7 mRNA are up regulated in MDA-MB-231 and C13* cells. Moreover, target HDAC4 and HDAC7 by TSA or shRNA block MDA-MB-231and C13* migration. These results reveal a new link between HDACs and EMT in the regulation of breast and ovarian cancer migration.
- 13 putative driver genes were identified from TCGA somatic mutation data including TP53, CTNNB1, and AXIN1. Some genes were associated with clinical characteristics and survival outcomes.
- Multi-omic data from TCGA including RNA-seq, miRNA-seq, DNA methylation, and copy number variation were integrated using similarity network fusion and clustering. This identified 5 subtypes of HCC with different survival profiles.
- Predictive models were built for each subtype on individual omics datasets, showing concordance between 56-87%. Analysis of gene expression profiles revealed different active biological processes between the subtypes.
Gene expression mining for predicting survivability of patients in earlystage...ijbbjournal
After numerous breakthroughs in medicine, microbiology, and pathology in the past century, lung cancer
still remains as a leading cause of cancer-related death even in the developed countries. Lung cancer
accounts roughly for 30% of all cancer-related deaths in the world. Diagnosis and treatments are still
based on traditional histopathology. It is of paramount importance to predictthe survivability of patients in
early stages oflung cancer so that specific treatments can be sought. Nonetheless, histopathology has been
shown by previous studies to be inadequate in predicting lung cancerdevelopment and clinical outcome.
The microarray technology allows researchers to examine the expression of thousands of genes
simultaneously. This paper describes a state-of-the-art machine learning based approach called averaged
one-dependence estimators with subsumption resolution to tackle the problem of predictingwhether a
patient in early stages of lung cancer will survive by mining DNA microarray gene expression data. To
lower the computational complexity, we employ an entropy-based geneselection approach to select relevant
genes that are directly responsible for lungcancer survivability prognosis. The proposed system has
achieved an average accuracy of 92.31% in predicting lung cancer survivability over 2 independent
datasets. The experimental results provide confirmation that gene expression mining can be used to predict
survivability of patients in earlystages of lung cancer.
Translation of microarray data into clinically relevant cancer diagnostic tes...Tapan Baral
This study aimed to develop a simple and inexpensive diagnostic test to distinguish between malignant pleural mesothelioma (MPM) and lung adenocarcinoma (ADCA) based on gene expression ratios, as current methods can be challenging. The researchers used microarray data from 31 MPM and 150 ADCA samples to identify genes with highly correlated expression levels between the two cancer types. They tested the accuracy of diagnostic ratios formed from combinations of two or three of these genes in differentiating between MPM and ADCA in 149 additional samples. Using two or three gene expression ratios achieved 95% and 99% accurate differential diagnosis, respectively, demonstrating this approach may provide a clinically useful diagnostic tool.
Michael's IUCRL Poster 2014 Close to Final with CDW editsMichael Araya
1) The study analyzed gene expression data from The Cancer Genome Atlas to identify genes that correlate with Amot in breast cancer and AmotL2 in thyroid cancer.
2) Pathway analysis revealed several cancer-related pathways that covaried with Amot levels in breast cancer, including Wnt signaling and TGF-beta signaling.
3) Genes that positively correlated with Amot in breast cancer and negatively correlated with AmotL2 in thyroid cancer were enriched for those involved in tumor growth, metastasis, and cancer stem cells.
This document describes a bioinformatics tool that predicts gene expression in cancer using copy number alterations and methylation data as predictors in a linear model. The tool was created using data from The Cancer Genome Atlas and the R programming language. It analyzes specific cancers to identify oncogenes and tumor suppressor genes that are relevant to those cancers, and determines which genetic or epigenetic factors drive the expression of those genes. The tool provides comprehensive analysis of individual genes and ranks their relevance to different cancers. It allows researchers to efficiently understand disease-specific gene expression and identify additional genes involved in cancer.
This study compared genomic data from 39 HNSCC cell lines to genomic findings from 106 HNSCC tumors. Amplification of eight genes and deletion of five genes were found in both cell lines and tumors. Seventeen genes were only mutated in cell lines, suggesting these mutations arose in tissue culture. Conversely, 11 genes were only mutated in over 10% of tumors. Several mutant genes in the EGFR pathway were shared between cell lines and tumors. Pharmacologic profiling of six cell lines suggested PIK3CA mutation may predict sensitivity to EGFR/PI3K pathway drugs. These findings suggest correlating gene mutations between cell lines and tumors can guide selection of preclinical models.
This document reports on a study that identified cysteine-rich protein 2 (CRP2) as a new component of breast cancer cell invadopodia that promotes invasion and metastasis. The study found that CRP2 expression is higher in mesenchymal/invasive breast cancer cells and its expression level correlates with increased risk of metastasis in basal-like breast cancer patients. CRP2 was shown to localize to the actin core of invadopodia, where it bundles actin filaments. Knockdown of CRP2 reduced breast cancer cell invasion, matrix degradation, and MMP-9 expression/secretion. Ectopic expression of CRP2 in less invasive cells increased invasion. Depletion of CR
Assessing the clinical utility of cancer genomic and proteomic data across tu...Gul Muneer
This document summarizes a study that used machine learning to predict cancer patient survival based on integrating multiple types of molecular and clinical data from The Cancer Genome Atlas. The study found that combining molecular data like gene expression, methylation, and mutations with clinical data significantly improved survival prediction for kidney, ovarian, and lung cancers compared to using single data types alone. Analyzing the models provided biological insights into molecular subtypes and markers correlated with survival outcomes. The results suggest that more comprehensive molecular profiling of tumors could help stratify patients and identify targets for personalized cancer treatment.
This document summarizes a study that developed an approach to extend cancer pathways based on biological network topology analysis. The approach calculates correlation values between genes in a pathway and the overall pathway to identify new candidate genes for inclusion. It was tested on the prostate cancer pathway, identifying top candidate genes with strong literature support for involvement in prostate cancer. The results demonstrate that the pathway extension approach can accurately predict new genes highly relevant to the cancer, improving understanding and prognosis potential.
Potentiality of a triple microRNA classifier: miR- 193a-3p, miR-23a and miR-3...Enrique Moreno Gonzalez
MicroRNAs (miRNAs) are short, non-coding RNA molecules that act as regulators of gene expression. Circulating blood miRNAs offer great potential as cancer biomarkers. The objective of this study was to correlate the differential expression of miRNAs in tissue and blood in the identification of biomarkers for early detection of colorectal cancer (CRC).
Differences in microRNA expression during tumor development in the transition...Enrique Moreno Gonzalez
The prostate is divided into three glandular zones, the peripheral zone (PZ), the transition zone (TZ), and the central zone. Most prostate tumors arise in the peripheral zone (70-75%) and in the transition zone (20-25%) while only 10% arise in the central zone. The aim of this study was to investigate if differences in miRNA expression could be a possible explanation for the difference in propensity of tumors in the zones of the prostate.
This document discusses the identification of Necdin as a novel STAT3 target gene that is downregulated in human cancer. The researchers used microarray analysis to compare gene expression profiles between cells with constitutively active STAT3 and normal cells. They identified differentially expressed genes between cells expressing oncogenic v-Src or constitutively active STAT3-C. Genes common to both lists were most likely directly regulated by STAT3. Computational analysis identified Necdin, a negative growth regulator, as downregulated in cells with active STAT3. Experiments confirmed STAT3 directly binds to and regulates the Necdin promoter, and Necdin expression inversely correlates with STAT3 activity in cancer cell lines. This suggests STAT
Acute myeloid leukemia (AML) is a hematopoietic malignancy with a dismal outcome in the majority of cases. A detailed understanding of the genetic alterations and gene expression changes that contribute to its pathogenesis is important to improve prognostication, disease monitoring, and therapy. In this context, leukemia-associated misexpression of microRNAs (miRNAs) has been studied, but no coherent picture has emerged yet, thus warranting further investigations.
Cancer is caused by genetic and epigenetic changes that alter the cell genome. Cancer bioinformatics analyzes DNA, RNA, and protein sequences to better understand cancer mechanisms. It utilizes databases like CGAP, HCGP, and caBIG that integrate gene expression data from millions of tumor and normal tissues to determine cancer expression patterns. Methods in cancer bioinformatics include genomics to study whole genome changes, transcriptomics to analyze all gene transcripts, and proteomics to study protein expression, modifications, and interactions, with the goal of discovering new cancer diagnostics and therapeutics.
As an uncommon malignant tumor, hypopharyngeal cancer accounts for 3–5% of head and neck tumors [1]. Most pathological types of hypopharyngeal cancer are squamous cell carcinoma. Due to the occult anatomical location of hypopharyngeal cancer and poor surgical effect, local recurrence or distant metastasis often occurs in patients with hypopharyngeal cancer following surgery.
This systematic review and meta-analysis examines the prognostic value of microRNAs (miRNAs) in stage II colorectal cancer patients. Eighteen studies were included for analysis. The pooled hazard ratio for death in stage II colorectal cancer patients was 1.90, indicating upregulation of miRNAs is associated with worse prognosis. Subgroup analyses found individual miRNAs like miR21, miR215, miR143-5p and miR106a were also associated with worse prognosis when upregulated. The study aims to identify miRNAs that could serve as biomarkers to predict prognosis and guide treatment decisions like use of adjuvant chemotherapy in stage II colorectal cancer.
A physical sciences network characterization of non-tumorigenic and metastati...Shashaanka Ashili
To investigate the transition from non-cancerous to metastatic from a physical sciences perspective, the
Physical Sciences–Oncology Centers (PS-OC) Network performed molecular and biophysical comparative studies of the non-tumorigenic MCF-10A and metastatic DA-MB-231 breast epithelial cell lines, commonly used as models of cancer metastasis. Experiments were performed in 20 laboratories from 12 PS-OCs. Each laboratory was supplied with identical aliquots and common reagents and culture protocols. Analyses of these measurements revealed dramatic differences in their mechanics, migration, adhesion, oxygen response, and proteomic profiles. Model-based multi-omics approaches identified key differences between these cells’ regulatory networks involved in morphology and survival. These results provide a multifaceted description of cellular parameters of two widely used cell lines and demonstrate the value of the PS-OC Network approach for integration of diverse experimental observations to elucidate the phenotypes associated with cancer metastasis.
This document discusses the role of the cell polarity regulator PARD3 in lung squamous cell carcinomas (LSCC). The key points are:
1. PARD3 was found to be somatically and biallelically inactivated through mutations, deletions, and promoter hypermethylation in 8% of examined LSCC tumors and cell lines. Most alterations resulted in truncated or non-functional PARD3 proteins.
2. Reconstitution of normal PAR3 activity in vivo reduced the invasive and metastatic properties of tumors, suggesting PARD3 acts as a tumor suppressor in LSCC.
3. PARD3 alterations prevented the formation of tight junctions between cells and the downstream signaling of STAT
Overexpression of YAP 1 contributes to progressive features and poor prognosi...Enrique Moreno Gonzalez
Yes-associated protein 1 (YAP 1), the nuclear effector of the Hippo pathway, is a key regulator of organ size and a candidate human oncogene in multiple tumors. However, the expression dynamics of YAP 1 in urothelial carcinoma of the bladder (UCB) and its clinical/prognostic significance are unclear.
Cancer recognition from dna microarray gene expression data using averaged on...IJCI JOURNAL
Cancer is a major leading cause of death and responsible for around 13% of all deaths world-wide. Cancer
incidence rate is growing at an alarming rate in the world. Despite the fact that cancer is preventable and
curable in early stages, the vast majority of patients are diagnosed with cancer very late. Therefore, it is of
paramount importance to prevent and detect cancer early. Nonetheless, conventional methods of detecting
and diagnosing cancer rely solely on skilled physicians, with the help of medical imaging, to detect certain
symptoms that usually appear in the late stages of cancer. The microarray gene expression technology is a
promising technology that can detect cancerous cells in early stages of cancer by analyzing gene
expression of tissue samples. The microarray technology allows researchers to examine the expression of
thousands of genes simultaneously. This paper describes a state-of-the-art machine learning based
approach called averaged one-dependence estimators with subsumption resolution to tackle the problem of
recognizing cancer from DNA microarray gene expression data. To lower the computational complexity
and to increase the generalization capability of the system, we employ an entropy-based geneselection
approach to select relevant gene that are directly responsible for cancer discrimination. This proposed
system has achieved an average accuracy of 98.94% in recognizing and classifyingcancer over 11
benchmark cancer datasets. The experimental results demonstrate the efficacy of our framework.
Recently, a phase II clinical trial in hepatocellular carcinoma (HCC) has suggested that the combination of sorafenib and 5-fluorouracil (5-FU) is feasible and side effects are manageable. However, preclinical experimental data explaining the interaction mechanism(s) are lacking. Our objective is to investigate the anticancer efficacy and mechanism of combined sorafenib and 5-FU therapy in vitro in HCC cell lines MHCC97H and SMMC-7721.
Interrogating differences in expression of targeted gene sets to predict brea...Enrique Moreno Gonzalez
Genomics provides opportunities to develop precise tests for diagnostics, therapy selection and monitoring. From analyses of our studies and those of published results, 32 candidate genes were identified, whose expression appears related to clinical outcome of breast cancer. Expression of these genes was validated by qPCR and correlated with clinical follow-up to identify a gene subset for development of a prognostic test.
Clinical and experimental studies regarding the expression and diagnostic val...Enrique Moreno Gonzalez
Carcinoembryonic antigen-related cell adhesion molecule 1 (CEACAM1) is a multifunctional Ig-like cell adhesion molecule that has a wide range of biological functions. According to previous reports, serum CEACAM1 is dysregulated in different malignant tumours and associated with tumour progression. However, the serum CEACAM1 expression in nonsmall-cell lung carcinomas (NSCLC) is unclear. The different expression ratio of CEACAM1-S and CEACAM1-L isoform has seldom been investigated in NSCLC. This research is intended to study the serum CEACAM1 and the ratio of CEACAM1-S/L isoforms in NSCLC.
This study characterized the non-coding RNA landscape in head and neck squamous cell carcinoma (HNSCC) using RNA sequencing data from 422 HNSCC patients. 307 non-coding transcripts were found to be significantly correlated with patient survival, associated with mutations in cancer genes like TP53 and CDKN2A, and correlated with copy number variations in chromosomes 3, 5, 7, and 18. Experimental validation of 3 selected non-coding RNAs - lnc-JPH1-7, miR-654-3p, and piR-34736 - found their expression levels associated with tumor stage, HPV status, and other clinical characteristics. Modulation of lnc-JPH1-7 expression in cell lines
In Silico Prescription of Anticancer Drugs Reveals Targeting OpportunitiesNuria Lopez-Bigas
Large efforts dedicated to sequence thousands of tumor genome/exomes are expected to lead to significant improvements of precision cancer medicine. However, high inter-tumor heterogeneity is a major obstacle in the road to develop an arsenal of targeted cancer drugs to treat most cancer patients. Therefore, it is critical to understand the current scope of anti-cancer targeted drugs for different tumor types in order to use them with the highest efficacy, and to define priorities for the development of new ones. We have developed a novel methodology to interpret the genomes of a cohort of tumor samples and to assess their therapeutic opportunities. Starting with somatic mutations detected across the cohort, the methodology identifies the driver genes, highlights those that dominate the clonal landscape of the tumors and determines their mode of action. It then does an in-silico prescription of approved and candidate targeted drugs to each patient in the cohort. The application of this approach to a cohort of 6795 cancer samples of 28 different tumor types showed that the fraction of patients that could benefit from prescribed FDA-approved drugs is strikingly small. Nevertheless, it improves significantly if repurposing opportunities are taken into consideration, with large differences between tumor types. In addition, we identify 80 therapeutically unexploited cancer genes, tightly bound by pre-clinical small molecules or potentially suitable for molecule binding. The resource created with this analysis is also intended to provide interpretation of newly sequenced cancer genomes and to design pan-cancer and tumor type specific sequencing panels for efficient early cancer detection and clinical insight.
More details at http://www.intogen.org
The document discusses topics related to practicing bioinformatics including:
- Installing and working with the TextPad text editor
- Regular expressions (regex), including patterns, quantifiers, anchors, grouping, alternation, and variable interpolation
- Using regex memory variables ($1, $2, etc.) to extract matched substrings
- The s/// substitution operator and tr/// translation operator
- Applying these skills to tasks like finding restriction enzyme cut sites in DNA sequences
This document summarizes a presentation on next generation epigenetic profiling. It introduces epigenetics and how epigenetic changes like DNA methylation are important in causing cancer in addition to genetic changes. It discusses using methyl-binding domain sequencing to discover genome-wide methylation patterns and biomarkers. Examples are given of specific genes like MGMT and BRCA1 that show methylation changes in cancer. Integrating deep sequencing data with other assays is described to better understand methylation patterns and their effects on gene expression and cancer. Developing targeted panels of cancer-related genes with known epigenetic alterations is discussed for clinical applications.
This document reports on a study that identified cysteine-rich protein 2 (CRP2) as a new component of breast cancer cell invadopodia that promotes invasion and metastasis. The study found that CRP2 expression is higher in mesenchymal/invasive breast cancer cells and its expression level correlates with increased risk of metastasis in basal-like breast cancer patients. CRP2 was shown to localize to the actin core of invadopodia, where it bundles actin filaments. Knockdown of CRP2 reduced breast cancer cell invasion, matrix degradation, and MMP-9 expression/secretion. Ectopic expression of CRP2 in less invasive cells increased invasion. Depletion of CR
Assessing the clinical utility of cancer genomic and proteomic data across tu...Gul Muneer
This document summarizes a study that used machine learning to predict cancer patient survival based on integrating multiple types of molecular and clinical data from The Cancer Genome Atlas. The study found that combining molecular data like gene expression, methylation, and mutations with clinical data significantly improved survival prediction for kidney, ovarian, and lung cancers compared to using single data types alone. Analyzing the models provided biological insights into molecular subtypes and markers correlated with survival outcomes. The results suggest that more comprehensive molecular profiling of tumors could help stratify patients and identify targets for personalized cancer treatment.
This document summarizes a study that developed an approach to extend cancer pathways based on biological network topology analysis. The approach calculates correlation values between genes in a pathway and the overall pathway to identify new candidate genes for inclusion. It was tested on the prostate cancer pathway, identifying top candidate genes with strong literature support for involvement in prostate cancer. The results demonstrate that the pathway extension approach can accurately predict new genes highly relevant to the cancer, improving understanding and prognosis potential.
Potentiality of a triple microRNA classifier: miR- 193a-3p, miR-23a and miR-3...Enrique Moreno Gonzalez
MicroRNAs (miRNAs) are short, non-coding RNA molecules that act as regulators of gene expression. Circulating blood miRNAs offer great potential as cancer biomarkers. The objective of this study was to correlate the differential expression of miRNAs in tissue and blood in the identification of biomarkers for early detection of colorectal cancer (CRC).
Differences in microRNA expression during tumor development in the transition...Enrique Moreno Gonzalez
The prostate is divided into three glandular zones, the peripheral zone (PZ), the transition zone (TZ), and the central zone. Most prostate tumors arise in the peripheral zone (70-75%) and in the transition zone (20-25%) while only 10% arise in the central zone. The aim of this study was to investigate if differences in miRNA expression could be a possible explanation for the difference in propensity of tumors in the zones of the prostate.
This document discusses the identification of Necdin as a novel STAT3 target gene that is downregulated in human cancer. The researchers used microarray analysis to compare gene expression profiles between cells with constitutively active STAT3 and normal cells. They identified differentially expressed genes between cells expressing oncogenic v-Src or constitutively active STAT3-C. Genes common to both lists were most likely directly regulated by STAT3. Computational analysis identified Necdin, a negative growth regulator, as downregulated in cells with active STAT3. Experiments confirmed STAT3 directly binds to and regulates the Necdin promoter, and Necdin expression inversely correlates with STAT3 activity in cancer cell lines. This suggests STAT
Acute myeloid leukemia (AML) is a hematopoietic malignancy with a dismal outcome in the majority of cases. A detailed understanding of the genetic alterations and gene expression changes that contribute to its pathogenesis is important to improve prognostication, disease monitoring, and therapy. In this context, leukemia-associated misexpression of microRNAs (miRNAs) has been studied, but no coherent picture has emerged yet, thus warranting further investigations.
Cancer is caused by genetic and epigenetic changes that alter the cell genome. Cancer bioinformatics analyzes DNA, RNA, and protein sequences to better understand cancer mechanisms. It utilizes databases like CGAP, HCGP, and caBIG that integrate gene expression data from millions of tumor and normal tissues to determine cancer expression patterns. Methods in cancer bioinformatics include genomics to study whole genome changes, transcriptomics to analyze all gene transcripts, and proteomics to study protein expression, modifications, and interactions, with the goal of discovering new cancer diagnostics and therapeutics.
As an uncommon malignant tumor, hypopharyngeal cancer accounts for 3–5% of head and neck tumors [1]. Most pathological types of hypopharyngeal cancer are squamous cell carcinoma. Due to the occult anatomical location of hypopharyngeal cancer and poor surgical effect, local recurrence or distant metastasis often occurs in patients with hypopharyngeal cancer following surgery.
This systematic review and meta-analysis examines the prognostic value of microRNAs (miRNAs) in stage II colorectal cancer patients. Eighteen studies were included for analysis. The pooled hazard ratio for death in stage II colorectal cancer patients was 1.90, indicating upregulation of miRNAs is associated with worse prognosis. Subgroup analyses found individual miRNAs like miR21, miR215, miR143-5p and miR106a were also associated with worse prognosis when upregulated. The study aims to identify miRNAs that could serve as biomarkers to predict prognosis and guide treatment decisions like use of adjuvant chemotherapy in stage II colorectal cancer.
A physical sciences network characterization of non-tumorigenic and metastati...Shashaanka Ashili
To investigate the transition from non-cancerous to metastatic from a physical sciences perspective, the
Physical Sciences–Oncology Centers (PS-OC) Network performed molecular and biophysical comparative studies of the non-tumorigenic MCF-10A and metastatic DA-MB-231 breast epithelial cell lines, commonly used as models of cancer metastasis. Experiments were performed in 20 laboratories from 12 PS-OCs. Each laboratory was supplied with identical aliquots and common reagents and culture protocols. Analyses of these measurements revealed dramatic differences in their mechanics, migration, adhesion, oxygen response, and proteomic profiles. Model-based multi-omics approaches identified key differences between these cells’ regulatory networks involved in morphology and survival. These results provide a multifaceted description of cellular parameters of two widely used cell lines and demonstrate the value of the PS-OC Network approach for integration of diverse experimental observations to elucidate the phenotypes associated with cancer metastasis.
This document discusses the role of the cell polarity regulator PARD3 in lung squamous cell carcinomas (LSCC). The key points are:
1. PARD3 was found to be somatically and biallelically inactivated through mutations, deletions, and promoter hypermethylation in 8% of examined LSCC tumors and cell lines. Most alterations resulted in truncated or non-functional PARD3 proteins.
2. Reconstitution of normal PAR3 activity in vivo reduced the invasive and metastatic properties of tumors, suggesting PARD3 acts as a tumor suppressor in LSCC.
3. PARD3 alterations prevented the formation of tight junctions between cells and the downstream signaling of STAT
Overexpression of YAP 1 contributes to progressive features and poor prognosi...Enrique Moreno Gonzalez
Yes-associated protein 1 (YAP 1), the nuclear effector of the Hippo pathway, is a key regulator of organ size and a candidate human oncogene in multiple tumors. However, the expression dynamics of YAP 1 in urothelial carcinoma of the bladder (UCB) and its clinical/prognostic significance are unclear.
Cancer recognition from dna microarray gene expression data using averaged on...IJCI JOURNAL
Cancer is a major leading cause of death and responsible for around 13% of all deaths world-wide. Cancer
incidence rate is growing at an alarming rate in the world. Despite the fact that cancer is preventable and
curable in early stages, the vast majority of patients are diagnosed with cancer very late. Therefore, it is of
paramount importance to prevent and detect cancer early. Nonetheless, conventional methods of detecting
and diagnosing cancer rely solely on skilled physicians, with the help of medical imaging, to detect certain
symptoms that usually appear in the late stages of cancer. The microarray gene expression technology is a
promising technology that can detect cancerous cells in early stages of cancer by analyzing gene
expression of tissue samples. The microarray technology allows researchers to examine the expression of
thousands of genes simultaneously. This paper describes a state-of-the-art machine learning based
approach called averaged one-dependence estimators with subsumption resolution to tackle the problem of
recognizing cancer from DNA microarray gene expression data. To lower the computational complexity
and to increase the generalization capability of the system, we employ an entropy-based geneselection
approach to select relevant gene that are directly responsible for cancer discrimination. This proposed
system has achieved an average accuracy of 98.94% in recognizing and classifyingcancer over 11
benchmark cancer datasets. The experimental results demonstrate the efficacy of our framework.
Recently, a phase II clinical trial in hepatocellular carcinoma (HCC) has suggested that the combination of sorafenib and 5-fluorouracil (5-FU) is feasible and side effects are manageable. However, preclinical experimental data explaining the interaction mechanism(s) are lacking. Our objective is to investigate the anticancer efficacy and mechanism of combined sorafenib and 5-FU therapy in vitro in HCC cell lines MHCC97H and SMMC-7721.
Interrogating differences in expression of targeted gene sets to predict brea...Enrique Moreno Gonzalez
Genomics provides opportunities to develop precise tests for diagnostics, therapy selection and monitoring. From analyses of our studies and those of published results, 32 candidate genes were identified, whose expression appears related to clinical outcome of breast cancer. Expression of these genes was validated by qPCR and correlated with clinical follow-up to identify a gene subset for development of a prognostic test.
Clinical and experimental studies regarding the expression and diagnostic val...Enrique Moreno Gonzalez
Carcinoembryonic antigen-related cell adhesion molecule 1 (CEACAM1) is a multifunctional Ig-like cell adhesion molecule that has a wide range of biological functions. According to previous reports, serum CEACAM1 is dysregulated in different malignant tumours and associated with tumour progression. However, the serum CEACAM1 expression in nonsmall-cell lung carcinomas (NSCLC) is unclear. The different expression ratio of CEACAM1-S and CEACAM1-L isoform has seldom been investigated in NSCLC. This research is intended to study the serum CEACAM1 and the ratio of CEACAM1-S/L isoforms in NSCLC.
This study characterized the non-coding RNA landscape in head and neck squamous cell carcinoma (HNSCC) using RNA sequencing data from 422 HNSCC patients. 307 non-coding transcripts were found to be significantly correlated with patient survival, associated with mutations in cancer genes like TP53 and CDKN2A, and correlated with copy number variations in chromosomes 3, 5, 7, and 18. Experimental validation of 3 selected non-coding RNAs - lnc-JPH1-7, miR-654-3p, and piR-34736 - found their expression levels associated with tumor stage, HPV status, and other clinical characteristics. Modulation of lnc-JPH1-7 expression in cell lines
In Silico Prescription of Anticancer Drugs Reveals Targeting OpportunitiesNuria Lopez-Bigas
Large efforts dedicated to sequence thousands of tumor genome/exomes are expected to lead to significant improvements of precision cancer medicine. However, high inter-tumor heterogeneity is a major obstacle in the road to develop an arsenal of targeted cancer drugs to treat most cancer patients. Therefore, it is critical to understand the current scope of anti-cancer targeted drugs for different tumor types in order to use them with the highest efficacy, and to define priorities for the development of new ones. We have developed a novel methodology to interpret the genomes of a cohort of tumor samples and to assess their therapeutic opportunities. Starting with somatic mutations detected across the cohort, the methodology identifies the driver genes, highlights those that dominate the clonal landscape of the tumors and determines their mode of action. It then does an in-silico prescription of approved and candidate targeted drugs to each patient in the cohort. The application of this approach to a cohort of 6795 cancer samples of 28 different tumor types showed that the fraction of patients that could benefit from prescribed FDA-approved drugs is strikingly small. Nevertheless, it improves significantly if repurposing opportunities are taken into consideration, with large differences between tumor types. In addition, we identify 80 therapeutically unexploited cancer genes, tightly bound by pre-clinical small molecules or potentially suitable for molecule binding. The resource created with this analysis is also intended to provide interpretation of newly sequenced cancer genomes and to design pan-cancer and tumor type specific sequencing panels for efficient early cancer detection and clinical insight.
More details at http://www.intogen.org
The document discusses topics related to practicing bioinformatics including:
- Installing and working with the TextPad text editor
- Regular expressions (regex), including patterns, quantifiers, anchors, grouping, alternation, and variable interpolation
- Using regex memory variables ($1, $2, etc.) to extract matched substrings
- The s/// substitution operator and tr/// translation operator
- Applying these skills to tasks like finding restriction enzyme cut sites in DNA sequences
This document summarizes a presentation on next generation epigenetic profiling. It introduces epigenetics and how epigenetic changes like DNA methylation are important in causing cancer in addition to genetic changes. It discusses using methyl-binding domain sequencing to discover genome-wide methylation patterns and biomarkers. Examples are given of specific genes like MGMT and BRCA1 that show methylation changes in cancer. Integrating deep sequencing data with other assays is described to better understand methylation patterns and their effects on gene expression and cancer. Developing targeted panels of cancer-related genes with known epigenetic alterations is discussed for clinical applications.
The document outlines the schedule and content for a bioinformatics course. It includes 10 lessons covering topics like biological databases, sequence alignments, database searching, phylogenetics, and protein structure. It also mentions that the final exam will include randomly generated images from a set of 713 images.
The document discusses BioPerl, an open source collection of Perl modules for bioinformatics tasks. It provides examples of using BioPerl to work with sequence objects, read sequences from files in different formats, and retrieve sequences from GenBank. Methods are demonstrated for looping through sequences, converting file formats, and calculating properties like isoelectric points. The most acidic and basic amino acids can be identified by isoelectric point, and there is a biological explanation for these results.
- The document outlines topics related to molecular biology databases including flat file sequence databases containing DNA, protein, and structure data, as well as relational databases.
- Examples of flat file formats like GenBank format are provided, showing how sequence data is stored in plain text files. Examples of sequence database resources like GenBank, EMBL, and DDBJ are also listed.
- The use of quality values assigned by Phred to base calls in DNA sequencing traces is discussed, and how these values are used by Phrap to determine consensus sequences from aligning multiple reads.
This document provides an overview of bioinformatics databases and file formats for storing genetic sequence data. It discusses flat file databases like GenBank that store sequences in plain text formats. It also describes relational databases that allow querying across related data fields. Examples of biological relational databases and tools for working with sequence data files are also presented.
An expression meta-analysis of predicted microRNA targets identifies a diagno...Yu Liang
This study identifies a 17-gene expression signature that can accurately classify lung adenocarcinoma and squamous cell carcinoma, the two major histological subtypes of lung cancer. The signature was developed by predicting microRNA target genes, filtering them based on gene ontology terms and ability to classify cancers, and analyzing gene expression data from multiple datasets. When validated in independent datasets, the signature correctly classified 87% of adenocarcinoma and 82% of squamous cell carcinoma samples on average. Expression of the signature also showed potential for early lung cancer detection in bronchial epithelial cells from smokers.
Increased Expression of GNL3L is Associated with Aggressive Phenotypes in Col...JohnJulie1
This study investigated the role and clinical significance of GNL3L in colorectal cancer (CRC). Analysis of public gene expression databases showed that GNL3L is overexpressed in CRC tissues compared to normal tissues. Knockdown of GNL3L in CRC cell lines suppressed cell proliferation in vitro and tumor growth in mouse models. Further experiments revealed that GNL3L promotes CRC proliferation through the ERK/MAPK signaling pathway. This study suggests that high GNL3L expression may serve as a potential therapeutic target in CRC and that GNL3L promotes CRC by activating the ERK/MAPK pathway.
This study performed a genome-wide analysis of DNA methylation in colorectal carcinoma (CRC) tissue samples from 24 Bangladeshi patients. The researchers found a total of 627 differentially methylated loci covering 513 genes when comparing CRC tissue to normal adjacent tissue, with 535 loci covering 465 genes being newly identified. Gene set enrichment analysis showed hypermethylation in CRC of gene sets related to inhibition of adenylate cyclase activity, Rac guanyl-nucleotide exchange factor activity, regulation of retinoic acid receptor signaling, and estrogen receptor activity. Predictive models based on differentially methylated loci showed potential for CRC diagnosis with around 89% sensitivity and specificity.
This study profiled circulating microRNAs (miRNAs) in human plasma and serum samples using deep sequencing of small RNA libraries. The researchers detected placental-specific miRNAs in maternal and newborn circulation, quantifying their relative abundance. They also found that sequence variations in placental miRNA profiles could be traced to the specific placenta of origin. This deep sequencing approach provides a comprehensive characterization of the miRNA content in circulation and establishes its potential for biomarker discovery and noninvasive detection of diseases originating from solid tissues like tumors or the placenta during pregnancy.
Prognostic Value of LINC01600 and CASC15 as Competitive Endogenous RNAs in Lu...daranisaha
This document describes a study that analyzed genetic alteration profiles of long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) in lung adenocarcinoma (LUAD) samples from The Cancer Genome Atlas database. It constructed a LUAD-related lncRNA-miRNA-mRNA competitive endogenous RNA network including 24 lncRNAs, 21 miRNAs and 142 mRNAs. Two lncRNAs, LINC01600 and CASC15, showed potential as prognostic biomarkers for LUAD patient outcomes like overall survival and recurrence risk.
This document summarizes research on analyzing differential miRNA-mRNA co-expression networks in colorectal cancer. The researchers analyzed paired expression data from cancer and normal tissues to identify changes in interactions. They found that cancer networks have decreased connectivity and identified differentially connected genes, including known cancer genes. Pathway analysis revealed an alteration in colorectal cancer tissues in the interplay between miRNAs and the eukaryotic translation initiation factor 3 complex, which is important for translation. Certain miRNAs were also identified as having many differentially co-expressed target mRNAs.
This document reviews the role of microRNAs (miRNAs) in cancer. It discusses how miRNAs are involved in regulating gene expression and cellular processes. Aberrant miRNA expression has been found in many cancer types and can influence cancer-related signaling pathways. The document summarizes the mechanisms that can lead to abnormal miRNA expression levels in cancer, including genetic alterations, epigenetic changes, and defects in the miRNA biogenesis pathway. It also discusses how miRNA profiling can be used for cancer diagnosis and how circulating miRNA levels in body fluids are being investigated as potential non-invasive diagnostic biomarkers.
Cancer is one of the deadliest diseases in the world and is responsible for around 13% of all deaths worldwide.
Cancer incidence rate is growing at an alarming rate in the world. Despite the fact that cancer is
preventable and curable in early stages, the vast majority of patients are diagnosed with cancer very late.
Furthermore, cancer commonly comes back after years of treatment. Therefore, it is of paramount
importance to predict cancer recurrence so that specific treatments can be sought. Nonetheless,
conventional methods of predicting cancer recurrence rely solely on histopathology and the results are not
very reliable. The microarray gene expression technology is a promising technology that couldpredict
cancer recurrence by analyzing the gene expression of sample cells. The microarray technology allows
researchers to examine the expression of thousands of genes simultaneously. This paper describes a stateof-
the-art machine learning based approach called averaged one-dependence estimators with subsumption
resolution to tackle the problem of predicting, from DNA microarray gene expression data, whether a
particular cancer will recur within a specific timeframe, which is usually 5 years. To lower the
computational complexity, we employ an entropy-based geneselection approach to select relevant
prognosticgenes that are directly responsible for recurrence prediction. This proposed system has achieved
an average accuracy of 98.9% in predicting cancer recurrence over 3 datasets. The experimental results
demonstrate the efficacy of our framework.
1) The authors are using the CRISPR-Cas9 system to induce double-strand breaks near centromeres on chromosomes 3p and 8p in order to generate models of partial aneuploidy through homologous recombination.
2) They have successfully targeted and induced breaks on these chromosomes, and selected for cells that underwent recombination to replace the chromosome arm with an artificial telomere.
3) In the future, they aim to characterize the phenotypic and tumorigenic effects of specific chromosomal arm losses to further understand their role in cancer formation and progression.
This document discusses the use of artificial intelligence and machine learning techniques to integrate multiomics data for precision oncology. It defines multiomics data as including imaging radiomics and various types of molecular biomarkers obtained from omics technologies. The document outlines how multiomics data can be used for cancer diagnosis, prognosis, prediction, risk stratification, and treatment monitoring and adaptation. It also describes various data integration and machine learning modeling methods that can be applied to multiomics data, as well as challenges around data heterogeneity, availability, and clinical validation.
Construction and Validation of Prognostic Signature Model Based on Metastatic...daranisaha
Colorectal Cancer (CRC) is a common malignant cancer with a poor prognosis. Liver metastasis is the dominant cause of death in CRC patients, and it often involves changes in various gene expression profiling. This study proposed to construct and validate a risk model based on differentially expressed genes between the primary and liver metastatic tumors from CRC for prognostic prediction.
The document discusses the application of gene chips (microarrays) in various fields of medicine. It describes how microarray technology allows for the analysis of large numbers of gene expressions and samples to study cancers, oral lesions, antibiotic treatments, and more. Microarrays are being used in areas like cancer research, forensic analysis, toxicology and more. They provide major insights into gene functions and can help with early disease detection, prognosis, and pharmacological therapy development.
Forecasting clinical behavior and therapeutic response of human cancer currently utilizes a limited number of tumor markers in combination with characteristics of the patient and their disease. Although few tumor markers and molecular targets exist for evaluation, the wealth of information derived from recent sequencing advancements provides greater opportunities to develop more precise tests for diagnostics, prognostics, therapy selection and monitoring in the future. The objectives of this study are to study miRNA and mRNA expression profiles of laser capture microdissection (LCM)-procured tumor cells and intact serial sections of breast tissue samples using next generation sequencing (NGS) methods. Our hypothesis is that miRNA signatures discerned from specific tumor cell populations more precisely correlate with behavior than that provided by conventional biomarkers from intact tissue samples. Additionally, we hypothesize the data generated in this study will present mRNA signatures informative for breast tumor research and support our miRNA findings through suggesting relevant miRNA:mRNA target associations.
De-identified frozen research samples of primary invasive ductal tumors of known grade and biomarker status containing 35-70% tumor were selected from an IRB-approved Biorepository. Comparison of expressed miRNAs from intact tissue sections with those of cognate tumor cells procured by LCM revealed, in general, that smaller defined miRNA gene sets were expressed in LCM isolated populations of tumor cells. In addition to miRNA sequencing, targeted RNA sequencing with the Ion AmpliSeq™ Transcriptome Human Gene Expression Kit was used to capture mRNA expression information. Data presented here demonstrates high mapping rates for targeted mRNA (>91% of reads) and miRNA (> 88% of reads) libraries. We also demonstrate high technical reproducibility between multiple libraries from the same tumor sample for both mRNA (R>0.99) and miRNA (R>0.97) libraries. We also report suggested miRNA:mRNA target associations identified in our set of breast tumor research samples. These data provide insights into breast cancer biology that may lead to new molecular diagnostics and targets for drug design in the future as well as an improved understanding of the molecular basis of clinical behavior and potential therapeutic response.
CXCL1, CCL20, STAT1 was Identified and Validated as a Key Biomarker Related t...semualkaira
Growing evidence suggests a correlation between ulcerative colitis (UC) and immune markers. Pathogenesis of UC was not yet been clearly elucidated, and few researches on immune-related biomarkers published.
CXCL1, CCL20, STAT1 was Identified and Validated as a Key Biomarker Related t...semualkaira
Growing evidence suggests a correlation between ulcerative colitis (UC) and immune markers. Pathogenesis
of UC was not yet been clearly elucidated, and few researches on
immune-related biomarkers published.
Presentation by Scott Woodman, MD, PhD. Presented at the 2018 Eyes on a Cure: Patient & Caregiver Symposium, hosted by the Melanoma Research Foundation's CURE OM initiative.
1) Researchers developed a blood-based test called TriMeth for detecting colorectal cancer (CRC) using three tumor-specific DNA methylation markers (C9orf50, KCNQ5, and CLIP4) measured in circulating cell-free DNA.
2) TriMeth was able to detect CRC with 85% sensitivity (218/256 patients) and 99% specificity (176/178 individuals) across two independent plasma cohorts, including 80% sensitivity for stage I CRC.
3) The three markers were identified through a biomarker discovery study involving DNA methylation profiling of over 5000 tumor and blood cell samples, and were validated to be hypermethylated in CRC and adenoma tissues but not in
A Review Of Databases Predicting The Effects Of SNPs In MiRNA Genes Or MiRNA-...Jennifer Daniel
This document reviews several databases that predict the effects of single nucleotide polymorphisms (SNPs) in microRNA (miRNA) genes and miRNA binding sites. It compares the core functionalities and features of databases like miRdSNP, MirSNP, PolymiRTS and miRNASNP. These databases store predictions of how SNPs may impact miRNA-target relationships by creating or deleting miRNA response elements (MREs) or changing binding affinity. The review evaluates these databases and outlines recommendations for future developments, highlighting the need for regularly updating SNP and miRNA data to keep predictions current.
Transcriptome Analysis of Spontaneous PDFJanaya Shelly
This document summarizes an analysis of gene expression in mouse lung tumors using RNA sequencing. Five mouse lung tumors - two spontaneous and three genetically engineered - were analyzed along with normal lung tissue. The top expressing genes in each tumor type were identified and their biological functions analyzed. Spontaneous tumors were associated with homeostatic processes while engineered tumors exhibited genes related to metastasis and immune response. The study aims to gain insights into lung cancer genetics using mouse models.
Three key points:
1. A kinome-centered synthetic lethality screen identified that suppression of the ERBB3 receptor tyrosine kinase sensitizes KRAS mutant lung and colon cancer cells to MEK inhibitors.
2. MEK inhibition results in MYC-dependent transcriptional upregulation of ERBB3, which is responsible for intrinsic drug resistance.
3. Drugs targeting both EGFR and ERBB2, each capable of forming hetero-dimers with ERBB3, can reverse unresponsiveness to MEK inhibition by decreasing inhibitory phosphorylation of the proapoptotic proteins BAD and BIM.
This document provides an overview of bioinformatics and biological databases. It discusses how bioinformatics draws from fields like biology, computer science, statistics, and machine learning. Biological databases are important resources for bioinformatics that can be searched and analyzed to answer questions, find similar sequences, locate patterns, and make predictions. The document also outlines common uses of biological databases, such as annotation searches, homology searches, pattern searches, and predictive analyses.
The document discusses the Rh blood group system and its clinical significance. It describes the key observations in 1939 that linked adverse reactions in mothers to stillborn fetuses and blood transfusions from fathers, indicating a relationship. This syndrome is now called hemolytic disease of the fetus and newborn. The Rh system was identified in 1940 through experiments immunizing animals with Rhesus macaque monkey red blood cells. The D antigen is the most important RBC antigen in transfusion practice, as those lacking it do not produce anti-D antibody unless exposed to D antigen through transfusion or pregnancy. Testing for D is routinely performed to ensure D-negative patients receive D-negative blood.
The document discusses views and materialized views in data warehousing and decision support systems. It covers three main points:
1) OLAP queries typically involve aggregate queries, so precomputation is essential for fast response times. Materialized views allow precomputing aggregates across multiple dimensions.
2) Warehouses can be thought of as collections of asynchronously replicated tables and periodically maintained views, renewing interest in efficient view maintenance.
3) Materialized views store the results of views in the database for fast access like a cache, but they require maintenance as underlying tables change. Incremental maintenance algorithms are ideal to efficiently update materialized views.
The document discusses various database concepts including normalization, which is used to design optimal relation schemas by removing redundant data. It also covers transaction processing, which involves executing logical database operations as transactions to maintain data integrity. Database systems use techniques like logging and concurrency control to prevent transaction anomalies and ensure failures can be recovered from.
This document contains a list of names, emails, and study programs of students. It includes their official student code, last name, first name, email, and educational program. There are 20 students listed with their details.
This document discusses the Biological Databases project being conducted by a group of students. The project involves using the video game Minecraft to visualize protein structures retrieved from the Protein Data Bank (PDB). Python scripts are used to import PDB data files and place blocks in Minecraft to represent atoms, with different block colors used to distinguish atom types. SPARQL queries are also employed to search the RDF version of the PDB for protein entries. The goal is to build 3D protein models inside Minecraft for educational and visualization purposes.
The document discusses various bioinformatics tools and algorithms for analyzing protein sequences, including Biopython for working with biological sequence data, the Kyte-Doolittle algorithm for predicting transmembrane regions, and the Chou-Fasman algorithm for predicting secondary structure from amino acid preferences for alpha helices, beta sheets, and random coils. It also provides examples of analyzing Swiss-Prot data to find properties of human proteins and applying these tools and libraries to extract insights from protein sequences.
The document discusses various topics related to analyzing protein sequences using Python and Biopython. It provides examples of using Biopython to parse sequence data from UniProt, calculate lengths and translations of sequences. It also discusses analyzing properties of sequences like molecular weight, isoelectric point, transmembrane regions, and comparing sequences to find conserved motifs. Finally, it introduces hydropathy indices and tools for predicting properties like transmembrane helices from primary sequences.
This document discusses Python functions. It explains that there are built-in functions provided as part of Python and user-defined functions. User-defined functions are created using the def keyword and can take parameters and return values. The body of a function is indented and runs when the function is called. Functions allow code to be reused and organized in a modular way. Examples are provided to demonstrate defining and calling functions with different parameters and return values.
The document provides a recap of Python programming concepts like conditions and statements, while loops, for loops, break and continue statements, and working with strings. It also introduces regular expressions as a way to match patterns in strings using a formal language that can be interpreted by a regular expression processor.
[SUMMARY
This document discusses next generation DNA sequencing technologies. It begins by describing some of the limitations of traditional Sanger sequencing, such as read lengths of 500-1000 bases and throughput of 57,000 bases per run. It then introduces some key next generation sequencing technologies, such as 454 sequencing which uses emulsion PCR and pyrosequencing to achieve read lengths of 20-100 bases but higher throughput of 20-100 Mb per run. Illumina/Solexa sequencing is also discussed, which uses sequencing by synthesis with reversible terminators and laser-based detection. Finally, third generation sequencing technologies are mentioned, such as Pacific Biosciences' single molecule real time sequencing and nanopore sequencing. In summary, the document provides a high-level
The document provides an overview of the history and evolution of various programming languages. It discusses early languages like FORTRAN, LISP, PASCAL, C, and Java. It also covers scripting languages and their uses. The document explains what Python is as a programming language - that it is interpreted, object-oriented, and high-level. It was named after Monty Python and was created by Guido van Rossum. The document then gives examples of using Python to program Minecraft by importing protein data from PDB files and using coordinates to place blocks to visualize proteins in the game.
This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through initiatives like the OBO Foundry. It introduces key semantic web technologies like RDF, URIs, Turtle syntax, and SPARQL query language. It provides examples of ontologies like the Gene Ontology and how ontologies can be represented and queried using these semantic web standards.
This document provides an overview of NoSQL databases, including:
- Key-value stores store data as maps or hashmaps and are efficient for data access but limited in query capabilities.
- Column-oriented stores group attributes into column families and store data efficiently but are operationally challenging.
- Document databases store loosely structured data like JSON and allow retrieving documents by keys or contents.
- Graph databases are suited for interaction networks and path finding but are less suited for tabular data.
The document discusses creating a multicore database project. It recommends taking the following steps:
1. Define what the project is about, what it aims to achieve, and who it is for.
2. Identify information resources and develop a basic data model.
3. Design a user interface mockup without technical constraints, thinking creatively.
This document discusses biological databases and PHP. It begins with an overview of biological databases and examples using BIOSQL to load genetic data from GenBank into a MySQL database. It then provides examples of building a basic 3-tier model with Apache, PHP, and a MySQL backend database. The document also includes a brief introduction to PHP, covering its history, why it is commonly used, and basic syntax like conditional statements.
This document discusses biological databases and SQL. It provides an overview of primary and derived data in biological research, as well as different data levels. It then discusses direct querying of selected bioinformatics databases using SQL and provides examples of 3-tier database models. The document proceeds to discuss rationale for learning SQL to query biological databases and provides definitions and explanations of key SQL concepts like tables, records, queries, data types, keys, integrity rules and constraints.
This document discusses biological databases and bioinformatics. It begins with an overview of bioinformatics as an interdisciplinary field combining biology, computer science, and information technology. It then discusses different types of biological databases, including those focused on sequences, pathways, protein structures, and gene expression. The document outlines some common uses of biological databases, including searching for annotations, identifying similar sequences through homology, searching for patterns, and making predictions. It also briefly discusses comparing data across databases. The summary provides a high-level overview of the key topics and uses of biological databases covered in the document.
The document discusses several topics related to protein structure prediction using Python:
1. It introduces the Chou-Fasman algorithm for predicting protein secondary structure from amino acid sequence. The algorithm calculates preference parameters for each amino acid to be in alpha helices, beta sheets, or other structures.
2. It provides an example of calculating helical propensity.
3. It lists the preference parameters output by the Chou-Fasman algorithm for each amino acid.
4. It outlines the steps of applying the Chou-Fasman algorithm to predict secondary structure elements in a protein sequence.
The document provides information on various Python programming concepts including control structures, lists, dictionaries, regular expressions, exceptions, and biological applications using Biopython. It discusses if/else statements, while and for loops, list operations, dictionary usage, regex patterns, exception handling roles, and gives examples analyzing protein sequences and structures using Biopython.
1. ARTICLE
Received 9 Jul 2013 | Accepted 9 Oct 2013 | Published 13 Nov 2013
DOI: 10.1038/ncomms3730
OPEN
Identification of a pan-cancer oncogenic microRNA
superfamily anchored by a central core seed motif
Mark P. Hamilton1, Kimal Rajapakshe1, Sean M. Hartig1, Boris Reva2, Michael D. McLellan3, Cyriac Kandoth3,
Li Ding3,4,5, Travis I. Zack6, Preethi H. Gunaratne7,8, David A. Wheeler8, Cristian Coarfa1 & Sean E. McGuire1,9
MicroRNAs modulate tumorigenesis through suppression of specific genes. As many tumour
types rely on overlapping oncogenic pathways, a core set of microRNAs may exist, which
consistently drives or suppresses tumorigenesis in many cancer types. Here we integrate The
Cancer Genome Atlas (TCGA) pan-cancer data set with a microRNA target atlas composed
of publicly available Argonaute Crosslinking Immunoprecipitation (AGO-CLIP) data to
identify pan-tumour microRNA drivers of cancer. Through this analysis, we show a
pan-cancer, coregulated oncogenic microRNA ‘superfamily’ consisting of the miR-17, miR-19,
miR-130, miR-93, miR-18, miR-455 and miR-210 seed families, which cotargets critical
tumour suppressors via a central GUGC core motif. We subsequently define mutations in
microRNA target sites using the AGO-CLIP microRNA target atlas and TCGA exomesequencing data. These combined analyses identify pan-cancer oncogenic cotargeting of the
phosphoinositide 3-kinase, TGFb and p53 pathways by the miR-17-19-130 superfamily
members.
1 Department of Molecular and Cellular Biology, Baylor College of Medicine, 1 Baylor Plaza Houston M822, Houston, Texas 77030, USA. 2 Computational
Biology Center, Memorial Sloan Kettering Cancer Center, New York, New York 10065, USA. 3 The Genome Institute, Washington University, St Louis, Missouri
63108, USA. 4 Department of Genetics, Washington University, St Louis, Missouri 63110, USA. 5 Siteman Cancer Center, Washington University, St Louis,
Missouri 63110, USA. 6 The Eli and Edythe L Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, Massachusetts
02142, USA. 7 Department of Biology and Biochemistry, University of Houston, 4800 Calhoun, Houston 77204, Texas, USA. 8 The Human Genome
Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA. 9 Division of Radiation Oncology, The University of Texas MD Anderson Cancer
Center, Houston, Texas 77030, USA. Correspondence and requests for materials should be addressed to S.E.M. (email: sean.mcguire@bcm.edu).
NATURE COMMUNICATIONS | 4:2730 | DOI: 10.1038/ncomms3730 | www.nature.com/naturecommunications
& 2013 Macmillan Publishers Limited. All rights reserved.
1
2. ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3730
M
icroRNAs are single-stranded RNA molecules (B22
nucleotides) that repress messenger RNA translation1
and promote mRNA degradation2,3. MicroRNAs are
critical regulators of oncogenesis and their regulation of cancer
cell signalling is complex. Global microRNA expression is often
repressed in cancer4–7. However, some microRNAs are
oncogenic7–10, exhibiting amplified expression in many tumour
types.
Facilitated by Argonaute proteins, microRNAs bind target
mRNAs in the RNA-induced silencing complex. MicroRNA
target regulation is canonically mediated by nucleotides 2–8 on
the 50 -end of the microRNA strand, termed the microRNA
seed11. A minimum of six consecutive nucleotides is required to
pair the microRNA with its target mRNA11,12. This minimal
binding requirement allows a given microRNA to potentially bind
tens, hundreds or thousands of mRNA targets13.
One difficulty in determining the functions of microRNAs in
tumours is the wide array of potential genes that any microRNA
might regulate. Established microRNA target prediction algorithms are based on inference, relying on evolutionary conservation of 30 -untranslated region (UTR) sequences complementary
to the microRNA seed and biochemical binding context to
determine putative microRNA binding sites14,15. Although these
algorithms are useful for predicting microRNA targets, especially
within the 30 -UTRs, they are not experimental demonstrations of
microRNA–target interactions and are often less able to
accurately predict microRNA binding within protein-coding
regions and non-coding RNAs (ncRNAs) because of reliance on
site-specific conservation11.
Argonaute Crosslinking Immunoprecipitation (AGO-CLIP)
data sets experimentally identify microRNA–target interactions
in a genome-wide manner through purification of Argonauteprotein-associated RNAs, which include bound microRNAs and
their respective targets16–18. In this study, to explore the
microRNA regulatory landscape across the TCGA Pan-Cancer
project19, which includes data from breast adenocarcinoma
(BRCA), lung adenocarcinoma (LUAD), lung squamous cell
carcinoma (LUSC), uterine corpus endometrioid carcinoma,
glioblastoma multiforme (GBM), head and neck squamous cell
carcinoma (HNSC), colon and rectal carcinoma (COAD, READ),
bladder urothelial carcinoma (BLCA), kidney renal clear cell
carcinoma (KIRC), ovarian serous cystadenocarcinoma (OV),
uterine corpus endometrial carcinoma (UCEC), and acute
myeloid leukemia (LAML), we compiled all publicly available
human AGO-CLIP data17,18,20–24 into a single unified atlas and
ranked individual microRNA target sites by total occurrences
across data sets. We integrated this substantial atlas of microRNA
target sites with TCGA pan-cancer microRNA, mRNA, copy
number variation (CNV) and exome-sequencing data sets to
discover common microRNA regulatory architecture across
tumour types. Finally, we developed an algorithm, miSNP, to
infer somatic mutations in these regulatory binding sites. Our
analysis represents integration of a new resource, the AGO-CLIP
atlas, and TCGA data, creating a method by which we were able
to understand microRNA regulatory architectures across multiple
tumour types. Collectively, this study identified a pan-cancer
oncogenic microRNA (oncomiR) network that cotargets multiple
potent tumour suppressors (TS) through a common core seed
motif.
Results
Global microRNA expression patterns in normal and tumour
tissue. The TCGA pan-cancer data set represents the single
largest compilation of microRNA-sequencing data in cancer
produced to date. Global analysis of microRNA expression
2
patterns in 4,186 tumours and 334 normal tissue samples
revealed the top 30 microRNAs constitute, on average, B90% of
all microRNA expression across heterogeneous normal tissues.
The same 30 microRNAs likewise comprise 80–90% of
microRNA expression in tumours (Fig. 1a,b, Supplementary
Tables S1 and S2)
miR-143 is the single, most highly expressed microRNA in
normal tissue, and miR-21 is the most highly expressed
microRNA in cancer (Fig. 1b). MicroRNA expression patterns
undergo global population changes between cancer and normal,
primarily due to increased miR-21 expression (from 6.9 to 19% of
all microRNA detected) and decreased miR-143 expression (from
33 to 11.2% of detectable microRNA) across tumour types.
AGO-CLIP atlas identifies global microRNA binding events.
AGO-CLIP technology employs ultraviolet crosslinking of RNA
to protein followed by immunoprecipitation to determine
RNA species bound to the Argonaute protein (Fig. 2a). AGO
Photoactivatable-Ribonucleoside-Enhanced CLIP (AGO-PARCLIP)17 includes an added step where nucleotide analogues
such as 4-thiouridine are introduced before crosslinking. These
nucleotide analogues, when crosslinked, undergo T–C transitions
during the reverse-transcription step of the AGO-CLIP
experiment17, allowing more confident visualization of RNA–
protein interaction (Fig. 2b).
We began by generating a large atlas of microRNA binding
sites by compiling publicly available AGO-CLIP data
(Supplementary Data 1)17,18,20–24. This included 11 AGO-PARCLIP libraries and 3 unmodified AGO-CLIP libraries (also called
Argonaute High-Throughput Sequencing of RNA Isolated by
CLIP (AGO-HITS-CLIP))16. The AGO-CLIP atlas allowed
us to integrate experimentally defined microRNA–target interactions with TCGA data to create the most accurate prediction
of microRNA binding patterns across TCGA cancers. The AGOCLIP seed atlas consists of 124,000 microRNA target clusters
that subsequently infer over 300,000 putative seed motifs within
those clusters. Individual seed sites were used as genomic
anchors to tabulate recurrent definition of a given seed across
14 AGO-CLIP data sets (Fig. 2c, Supplementary Data 1). Clusters
were then randomly permuted across the genome to determine
an exact binomial probability of cluster occurrence at a given seed
complement. False discovery rates (FDRs) for each seedcomplementary target were calculated from their probability
of recurrence (Supplementary Data 1–3). We found that Z3
occurrences of an AGO-CLIP peak on a given target site
corresponded to a significant event relative to a random
distribution of clusters (qo0.05 based on binomial P-value,
Supplementary Data 3). AGO-CLIP defined that cluster
localization by mRNA region is largely consistent with previous
reports16,17, with 60% of clusters mapping to the 30 -UTR, 24.7%
of clusters mapping to the coding region, 8.2% mapping to
the 50 -UTR and 7% mapping to ncRNAs (Fig. 2d).
DICER1, MDM2 and the long ncRNA (lncRNA) Xist are
among the top ten, most frequently targeted genes in the atlas,
suggesting that the microRNA functional roles may include
autoregulation, apoptotic sensitization through TP53 and lncRNA
function (Supplementary Data 2). Importantly, traditional
target prediction algorithms do not predict the high-frequency
interactions on both MDM2 and Xist, demonstrating the
strength of the unbiased AGO-CLIP platform. We also found
numerous interactions between the Argonaute proteins and
lncRNAs, small nucleolar RNAs and transfer RNAs in the AGOCLIP atlas. These findings are consistent with recent evidence,
suggesting that ncRNAs are microRNA targets or Argonaute
binding substrates25. Discovery of these interactions supports a
NATURE COMMUNICATIONS | 4:2730 | DOI: 10.1038/ncomms3730 | www.nature.com/naturecommunications
& 2013 Macmillan Publishers Limited. All rights reserved.
3. ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3730
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Normal
bladder
Normal
breast
Normal
H&N
Normal
kidney
Normal lung
Average LUAD
Normal lung
Average LUSC
Normal
uterus
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
BLCA
BRCA
HNSC
KIRC
LUSC
LUAD
UCEC
COAD
LAML
OV
READ
hsa-mir-143
hsa-mir-10b
hsa-mir-21
hsa-mir-22
hsa-mir-30a
hsa-mir-10a
hsa-mir-99b
hsa-mir-30a-star
hsa-mir-148a
hsa-mir-203
hsa-let-7b
hsa-mir-101-1
hsa-let-7a-2
hsa-mir-100
hsa-let-7a-3
hsa-mir-200c
hsa-mir-103-1
hsa-let-7f-2
hsa-mir-30d
hsa-mir-30e-star
hsa-mir-92a-2
hsa-mir-29a
hsa-let-7a-1
hsa-mir-375
hsa-mir-25
hsa-mir-30e
hsa-mir-182
hsa-let-7c
hsa-mir-28
hsa-mir-145
Other
Figure 1 | The landscape of microRNA expression in the TCGA pan-cancer data set. (a) Thirty microRNAs constitute 90% of microRNA expression
across all normal tissues. (b) Global microRNA expression change occurring in tumours is due principally to loss of miR-143 expression and gain
of miR-21 expression. MicroRNAs represented in columns from bottom to top listed left to right by row legend.
growing consensus that microRNA function extends beyond the
regulation of protein-coding genes.
Analysis of TCGA microRNA expression data. To define which
microRNAs consistently change relative to matched normal tissue, we performed significance testing on tumour versus normal
microRNA expression levels across samples. Inspection of the
TCGA microRNA data set revealed that significance testing
(Fisher’s exact test) between tumour and normal samples on the
raw reads per million values generated by high-level processed
TCGA data produced significantly more increased microRNAs
than decreased microRNAs (Supplementary Fig. S1A). The
reason for this differential significance level between increased
and decreased microRNAs is due to loss of highly expressed
microRNAs in tumour samples, especially loss of miR-143, which
accounts for 35–70% of microRNA expression in normal tissue.
miR-143 expression often decreases by 450% in tumours
(Fig. 1b). As microRNA expression in sequenced samples is
expressed as a population value (reads per million microRNAs
mapped), and because total RNA between tumour and normal
samples used in sequencing experiments is constant, the loss of
miR-143 leads to reciprocal gains in the proportion of other
microRNAs. This relationship is demonstrated by the strong
association of the absolute number of significantly increased
microRNAs and the loss of miR-143 (R2 ¼ À 0.86, P ¼ 0.01,
NATURE COMMUNICATIONS | 4:2730 | DOI: 10.1038/ncomms3730 | www.nature.com/naturecommunications
& 2013 Macmillan Publishers Limited. All rights reserved.
3
4. ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3730
RISC complex
RISC complex
GW182
AGO Other
Ultraviolet crosslink
miRNA
Bound mRNA 3′-UTR
GW182
Read
group
Other
miRNA
Bound mRNA 3′-UTR
AGO-PAR-CLIP data defines physical RNA–protein interaction through
transition, which occurs at ultraviolet crosslinks.
• Pull-down argonaut
with bound RNA
• Fragment RNA
Bound
miRNA
Bound
target
Clusters are defined at
Purify and sequence
bound RNA
miRNA
transitions based on a kernal density algorithm.
Seed motifs are known microRNA complements inferred from clusters.
Bound mRNA 3′-UTR
Data set 1
Data set 2
Target atlas
0.5
5′-UTR
CDS
3′-UTR
NC
0.4
0.3
0.2
0.1
0
All genes
Clusters per kilobase
Fraction of total clusters
0.7
0.6
9
8
7
6
5
4
3
2
1
0
All 5′-UTR CDS 3′-UTR NC
clusters cluster cluster cluster cluster
density density density density
Data set 3
Figure 2 | Generation of the AGO-CLIP microRNA target atlas. (a) Model of AGO-CLIP-mediated purification of bound microRNAs and target.
(b) Example of PAR-CLIP-defined microRNA binding sites. De novo seed identification in this data set requires PAR-CLIP reads. Supplementary identification
of recurrent binding sites is allowed for HITS-CLIP data. (c) Construction of seed atlas from multiple data sets emphasizes seeds recurrent across multiple
data sets to define high-confidence microRNA active sites. (d) MicroRNA clusters are most frequently mapped to the 30 -UTR region, consistent
with previous observations. 30 -UTR, 30 -untranslated region; 50 -UTR, 50 -untranslated region; CDS, coding sequence; NC, non-coding RNA. Error bars
represent s.e.m, data is taken from a compilation of 11 AGO-PAR-CLIP libraries used in this study and defined using the 12,449 UCSC known genes with at
least one AGO-CLIP cluster mapping to them.
Supplementary Fig. S1B). We corrected for these composition
differences using upper quartile and trimmed median of M-values
to normalize the data set26. These methods are designed to
compensate for differences between tissues (for example,
comparing the liver and the kidney) and are thus useful when
comparing tumour versus normal tissue microRNA values, because
they compensate for artefact generated by large expression changes
in the most prevalent microRNAs (Supplementary Fig. S1A–D;
Supplementary Data 4 contains detailed significance calculations
for microRNAs in all tumour types).
Definition of microRNA–target interactions. We next defined
pan-cancer oncomiRs and miR suppressors (tumour-suppressing
microRNAs) based on consistent expression changes across
cancer types. Pan-cancer oncomiRs were defined by significant
expression gain (qo0.05, Fisher’s exact test) in at least six out of
seven pan-cancer tumour types containing tumour versus normal
microRNA-sequencing data. Pan-cancer miR suppressors were
similarly defined by significant expression loss in at least six of
seven tumour types (Fig. 3a; Supplementary Data 5 contains
detailed pan-cancer microRNA selection data). To ensure we
were observing target interactions with highly expressed microRNAs that had many conserved target sites in the 30 -UTR, we
focused on the dominant arms of the 87 broadly conserved
microRNA families with an Argonaute-bound read group corresponding to the microRNA in at least 3 of the 14 AGO-CLIP
data sets in our analysis.
We examined interactions between putative pan-cancer
oncomiRs or miR suppressors, and their driver targets, based
on the assumption that pan-cancer oncomiRs are enriched for
TS targets and pan-cancer miR suppressors are enriched
for oncogenic targets. We then performed integrative analysis
4
of pan-cancer TS and oncogenes (OCs) by using available
pan-cancer data, including exome-sequencing single-nucleotide
variation (SNV) scores (MuSiC27,28 and MSKCC29 algorithms),
CNV analysis (GISTIC30,31 algorithm) and mRNA expression
changes as building blocks for integrated gene nomination
(Fig. 3b). We generated a continuous scale of relevant pancancer genes that describes putative TS as increasingly negative
values and putative OCs as increasingly positive values, based on
SNV, CNV and mRNA expression changes across TCGA tumour
types (Supplementary Data 6 and Methods).
We tested four methods for calling microRNA–target interactions including the following: using all AGO-CLIP-defined
binding sites without considering site conservation (for example,
TargetScan); using only AGO-CLIP-defined sites with Z3
occurrences (corresponding to a significant peak based on
random permutation) without considering TargetScan; using
TargetScan-only binding sites (without considering AGO-CLIP
data); and, finally, combining AGO-CLIP-defined target sites
with Z3 occurrences, or Z1 occurrences and a TargetScan call.
We found that combining AGO-CLIP and TargetScan results
(final method) was the only method that produced enrichments
of TS targets for pan-cancer oncomiRs and OC targets for pancancer miR suppressors (Fig. 3c, Supplementary Fig. S2A–D).
To gain insight into the differing enrichments, we explored
the target spectrums of AGO-CLIP-defined target sites and
TargetScan-defined target sites for the selected pan-cancer
microRNAs. We found that on average 25.6% of TargetScan
targets are also nominated by AGO-CLIP. Reciprocally, TargetScan also nominates 31.47% of AGO-CLIP targets. In total, 74.4%
of all TargetScan targets were not called by AGO-CLIP, and
TargetScan did not call 68.53% of AGO-CLIP targets. In the
case of AGO-CLIP, 34.61% of all targets were outside the
30 -UTR (coding region or 50 -UTR), leaving another 34.39% of all
NATURE COMMUNICATIONS | 4:2730 | DOI: 10.1038/ncomms3730 | www.nature.com/naturecommunications
& 2013 Macmillan Publishers Limited. All rights reserved.
5. ARTICLE
2.39
2.03
2.43
3.51
0.52
1.77
0.10
0.43
0.93
2.80
2.80
2.63
0.99
1.26
0.39
2.55
0.86
5.87
1.89
1.01
0.10
–3.18
–1.75
–1.86
–2.07
–1.93
–0.41
–0.60
–1.53
–0.90
–2.02
–2.72
–1.86
–0.99
–0.40
–1.73
–0.36
–2.00
–1.64
–0.83
–1.35
–1.35
–1.35
–0.17
–1.44
–3.27
2.36
1.61
2.28
4.17
0.35
1.79
0.46
0.60
0.88
1.55
1.82
3.10
0.64
1.13
1.44
1.03
2.68
1.20
5.26
1.53
0.38
0.67
–4.15
–2.37
–2.50
–3.31
–2.21
–0.88
–1.86
–1.30
–2.02
–3.28
–3.54
–2.09
–1.42
–1.88
–2.53
–0.03
–2.76
–2.35
1.33
–2.50
–2.49
–2.49
–0.66
–1.97
–3.78
4.83
4.03
4.09
3.09
2.84
2.08
2.38
1.77
1.54
2.71
2.66
2.70
1.95
1.37
1.38
1.00
1.29
0.71
2.41
3.19
0.69
–0.65
–3.20
–2.67
–2.22
–3.32
–3.09
–1.97
–3.30
–2.89
–1.63
–3.79
–3.71
–2.20
–2.32
–1.89
–2.22
–2.16
–2.61
–1.35
–2.82
–0.88
–0.88
–0.87
–0.66
–0.75
–0.69
MicroRNA pan-tumour
expression change
Target
scan
0
Gene
SNV
(MuSiC,
MSKCC)
More tumour suppressing targets.
0.8
0.6
Tumour suppressor targets enrichment
**
**
0.4
**
0.2
*
*
*
0
Top 100
Top 250
Top 500
Top 1000
–0.2
Top 1500
Top 2000
Top 2500
Top 3000
*
*
*
–0.4
**
*
–0.6
Log2 expression change
per tumour type
–2
AGO
CLIP
mRNA
expression
change
Gene
CNV
(GISTIC)
Log2 (average tumour suppressor targets/
oncogene targets)
–0.07
0.47
0.14
3.41
0.36
1.35
1.25
1.31
0.72
0.24
–0.14
1.67
0.24
2.13
0.84
1.18
1.84
1.32
0.66
1.32
2.35
0.70
–0.46
–0.91
–0.76
0.07
–0.18
–0.39
0.36
–0.36
–0.19
–1.34
–0.94
0.09
–0.83
–0.25
0.67
–0.63
0.23
–0.70
–1.35
–0.06
–0.06
–0.05
–0.11
1.15
1.66
UCEC
1.48
1.02
1.18
2.36
0.86
1.42
1.29
1.17
1.02
1.44
1.30
2.79
0.53
1.16
1.96
0.62
1.05
0.47
4.32
2.46
1.18
2.47
–1.97
–1.86
–1.20
–1.30
–0.44
–0.91
–1.63
–1.77
–1.23
–2.99
–2.63
–1.71
–1.00
–1.33
–1.72
–0.94
–1.36
–0.13
–1.07
–0.44
–0.44
–0.44
–0.52
–0.30
–1.15
LUSC
KIRC
2.77
2.03
2.93
3.27
0.95
1.69
1.42
0.90
1.14
1.21
2.49
3.22
1.50
1.98
1.94
1.23
2.95
1.27
3.17
2.87
1.94
1.20
–3.24
–0.53
–1.38
–1.54
–2.36
–0.16
–2.14
–2.17
–0.59
–5.50
–6.41
–1.89
–1.94
–0.83
–1.42
–2.06
–1.40
–0.61
–2.72
–0.41
–0.41
–0.41
–0.33
–0.42
–3.82
LUAD
HNSC
2.55
1.88
2.67
3.90
1.45
2.64
2.67
1.34
1.60
1.55
2.60
3.89
2.08
1.06
1.46
1.52
3.33
1.44
2.23
2.50
1.20
1.32
–3.06
–1.11
–1.39
–3.21
–2.84
–1.00
–2.75
–2.77
–1.19
–4.08
–3.93
–2.68
–2.44
–1.32
–2.09
–1.54
–1.68
–1.02
–0.55
–0.66
–0.66
–0.66
–0.56
–0.67
–1.14
BRCA
hsa-mir-183
hsa-mir-182
hsa-mir-96
hsa-mir-210
hsa-mir-425
hsa-mir-130b
hsa-mir-18a
hsa-mir-93
hsa-mir-17
hsa-mir-106a
hsa-mir-135b
hsa-mir-301b
hsa-mir-192
hsa-mir-142
hsa-mir-301a
hsa-mir-19a
hsa-mir-33b
hsa-mir-590
hsa-mir-196a-1
hsa-mir-7-1
hsa-mir-21
hsa-mir-455
hsa-mir-139
hsa-mir-101-1
hsa-mir-140
hsa-mir-143
hsa-mir-145
hsa-mir-27b
hsa-mir-100
hsa-mir-99a
hsa-mir-26a-2
hsa-mir-1-2
hsa-mir-133a-2
hsa-let-7c
hsa-mir-125b-1
hsa-mir-29a
hsa-mir-195
hsa-mir-10b
hsa-mir-101-2
hsa-mir-125a
hsa-mir-204
hsa-let-7a-2
hsa-let-7a-1
hsa-let-7a-3
hsa-mir-26b
hsa-let-7b
hsa-mir-451
BLCA
miRNA_ID
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3730
Oncogene target enrichment
More oncogenic targets.
2
OncomiR
miR-suppressor
Figure 3 | Determination of pan-cancer microRNAs and their targets. (a) List of broadly conserved pan-cancer oncomiRs and miR suppressors reveals
microRNAs undergoing consistent expression changes across the pan-cancer data set. (b) A dual nomination strategy uses AGO-CLIP microRNA
target definitions to associate pan-cancer microRNAs to respective TS and OC targets to define relevant pan-tumour microRNA–target relationships. (c) TS
target versus oncomiR target enrichments for pan-cancer oncomiRs (blue bars) and miR suppressors (red bars) across the top 100–3,000 (B10% of
genes) TS and OCs. This graph represents the enrichments and significance levels of those enrichments of both pan-cancer oncomiRs and miR
suppressors. Individual bars in the graph represent the log2-based per cent TS over per cent OC targets. Positive bars demonstrate average total
enrichment for TS. Negative bars demonstrate average enrichment for OCs. An asterisk defines significant enrichments. Red box highlights enrichment of
pan-cancer oncomiRs with their top 250 targets used in subsequent analysis. Student’s t-test, *Po0.05, **Po0.005. N-values for enrichment reflect
the total number of pan-cancer oncomiRs (n ¼ 22) and pan-cancer miR suppressors (n ¼ 25).
AGO-CLIP-defined targets within the 30 -UTR but not called by
TargetScan (results are summarized in Supplementary Table S3).
Calling interactions using the AGO-CLIP atlas alone produced
bias towards enrichment of OCs (Supplementary Fig. S2A,B),
whereas TargetScan alone produced bias towards TS
(Supplementary Fig. S2C). The AGO-CLIP data set produces
slight bias towards OCs, because the top 3,000 OCs have 31%
more AGO-CLIP clusters binding them than the top 3,000 TS.
This observation may reflect overexpression of these OCs in the
cell lines used to perform the AGO-CLIP analyses, or it may
reflect greater microRNA binding of OCs in general, which is
consistent with the tumour-suppressive function of many
microRNAs4–7. Notably, most of the targeting discrepancy
between TS and OCs is due to microRNA binding in the
coding region, with OCs having 66% more AGO-CLIP clusters
than TS (Supplementary Fig. S3A). TargetScan cannot predict
microRNA binding in coding regions of genes.
TargetScan may produce bias towards TS, because the top
3,000 TS have 40% larger 30 -UTR lengths than the top 3,000 OCs
(Supplementary Fig. S3B). The relative size of the 30 -UTR directly
determines the total number of predicted microRNA target sites
associated with that 30 -UTR, suggesting that TS undergo greater
30 -UTR-mediated cis-regulation in general. As many cell lines are
rapidly growing or are oncogenic, the relative lack of AGO-CLIP
clusters on the TS may reflect the cellular context of the AGOCLIP experiments wherein these genes could have reduced
expression owing to culturing conditions, rather than representing a general phenomenon.
Ultimately, determining microRNA–target interactions using
AGO-CLIP-defined target sites with Z3 occurrences, or Z1
occurrences, and a TargetScan call was the only method to
generate expected enrichments in TS targets for oncomiRs and
OC targets for miR suppressors. This method had the added
utility of combining target site conservation with genomic
experimental validation. Discovering expected enrichments when
combining TargetScan and AGO-CLIP values may suggest that
combined microRNA target calling yields improved accuracy
over a single method by reducing the false negatives and false
positives inherent in each technology. As such, we chose to define
a microRNA–target interaction as Z3 AGO-CLIP-defined
NATURE COMMUNICATIONS | 4:2730 | DOI: 10.1038/ncomms3730 | www.nature.com/naturecommunications
& 2013 Macmillan Publishers Limited. All rights reserved.
5
6. ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3730
occurrences, or Z1 occurrences, and a TargetScan for subsequent
analysis of tumour-driving microRNA–target interactions.
Identification of a pan-cancer oncomiR network. We observed
that the strongest overall microRNA–target enrichment was for
oncomiRs targeting the top 250 ranking TS (Fig. 3c, red box).
Many of these interactions involved cotargeting of multiple
microRNAs on the same TS. As many top TS targets were
cotargeted by the pan-cancer oncomiRs, we analysed seed
sequences of pan-cancer oncomiRs to determine whether any
core sequences were common to the individual oncomiR seeds.
Intriguingly, 10 of 22 (45.4%, Fig. 4a) pan-cancer oncomiRs in
our analysis share similar sequence homology in their seed region
that aligns around a central GUGC motif defining a microRNA
seed ‘superfamily’. GUGC motifs occur in 36 of 187 (ref. 13)
microRNAs from broadly conserved seed families (5.19%),
enriching seed families with a GUGC motif in their seed region
8.7-fold among pan-cancer oncomiRs compared with all broadly
conserved microRNAs (P ¼ 0.008, Wilcoxon rank-sum test,
Fig. 4a). None of the 25 pan-tumour miR suppressors identified
in our analysis contain a GUGC motif, making this motif significantly depleted among the identified miR suppressors
(P ¼ 0.016, Wilcoxon rank-sum test). This motif is also enriched
when including all microRNAs meeting our significance threshold, and not just the microRNAs from broadly conserved families
(P ¼ 5E À 10, Wilcoxon rank-sum test; the GUGC motif is present in 4.6% of all dominant-arm miRbase microRNAs and 26%
of all microRNAs significantly increased in 6/7 TCGA tumours).
Several superfamily microRNAs (miR-17/106a, miR-210, miR130b/301ab and miR-93/105) derive from the same seed family
(Fig. 4a). The miR-93/105 and miR-17/106a seed families have
virtually the same seeds, leading to highly similar predicted target
spectrums. miR-17, miR-106a, miR-18a and miR-19a are part of
the well-described miR-17B92 oncogenic cluster, also known as
oncomiR-1 (refs 9,10,32). MicroRNA seed similarity in the pancancer oncomiRs led us to hypothesize that these microRNAs
may undergo coordinate regulation to mutually cotarget and
suppress critical TS.
To test this hypothesis, we defined the target spectrum of the
pan-cancer microRNA superfamily (Fig. 4b,c). We observed
oncogenic microRNA superfamily cotargeting of high-ranking
pan-cancer TS such as SMAD4, ZBTB4 and TGFBR2, often at a
single complementary seed-target site we termed a microRNA
‘super-seed’ target, where multiple families of microRNAs bind
and regulate a specific 30 -UTR (Fig. 4b). In most superfamily
oncomiRs, the majority of targets predicted in the top 3,000 TS
are shared with at least one other superfamily member, including
70.2% of miR-19 targets, 79.3% of miR-130/301ab targets, 39.2%
of miR-17/106a/93 targets, 42.3% of miR-18a targets, 75.7%
miR-455 targets and 62.5% of miR-210 targets (Fig. 4c).
The entire miR-17-19-130-93-455-18-210 superfamily of pantumour oncomiRs identified in this analysis forms three separate
super-seed target sites; one consisting of the miR-17, miR-19 and
miR-130 families, one consisting of the miR-18 and miR-19
families, and one consisting of the miR-19 and miR-455 families
(Fig. 4b). The miR-17, miR-19 and miR-130 seed families
exhibited the majority of total TS cotargeting on high-ranking TS
in our data set. We thus focused further studies on tumour
regulation by this subset of oncomiRs.
To test possible coregulation of the miR-17, miR-19 and
miR-130 families, we correlated the expression levels of family
members across TCGA tumours and found strong positive
correlation of these microRNAs (average miR-17-19-130 family
member microRNA–microRNA correlate across 11 tumour types,
R2 ¼ 0.33, Po1E À 200 versus null distribution, Student’s t-test,
6
Fig. 4d). These data suggest that the miR-17-19-130 superfamily
members undergo coordinate regulation in tumours to mediate
silencing of TS genes in a synergistic manner.
To demonstrate the potential for cotargeting of TS by the miR17-19-130 superfamily members, we determined pan-cancer
correlates for all microRNA–target interactions in the top 250ranked TS. Pan-cancer correlation of high-ranking TS targeted by
the miR-17-19-130 superfamily revealed strong negative correlation of the superfamily with many pan-cancer TS, including
PTEN, ZBTB4 and TGFBR2, across all tumour types. Figure 5a
demonstrates correlations with the top four highest-ranked TS
(PTEN, TGFBR2, ZBTB4 and SMAD4) that are targeted by all
three seed families. PTEN, TGFBR2 and ZBTB4, all significantly
negatively correlated with the miR-17-19-130 family members
versus a null distribution of random microRNA–mRNA
correlates (Po1E À 15 for each, Student’s t-test). SMAD4
positively correlates with the superfamily in BLCA (Po1E À 10,
Student’s t-test), but otherwise shows no significant correlation,
potentially suggesting a role for the microRNAs in translational
repression of this target. Full correlate heat map for the
microRNA–target pairs in the top 250 TS versus all pan-cancer
oncomiRs is provided in Supplementary Data 7, a complementary
heat map for targets of pan-cancer miR suppressors paired with
the top 250 OCs is contained in Supplementary Data 8.
Next, we determined the ability of the miR-17-19-130 family to
suppress translation of the top cotargeted TS, PTEN, ZBTB4,
TGFBR2 and SMAD4 using 30 -UTR–luciferase fusions. We used
miR-17, -19a and -130b as representative members of each seed
family. We found cosuppressive capacity by pan-cancer oncomiRs on all four pan-cancer TS (Fig. 5b,c). In the case of ZBTB4,
PTEN and SMAD4, all miR-17-19-130 superfamily members were
able to bind to the 30 -UTR and significantly repress luciferase
activity. miR-19a did not significantly suppress the TGFBR2
30 -UTR (Fig. 5b).
The SMAD4 gene contains a single miR-17-19-130 super-seed
site that is highly conserved and few potential compensatory sites
able to bind the miR-17, -19 or -130 seeds. As such, we deleted
the central six nucleotides of the SMAD4 super-seed and
measured strong ablation of each microRNA seed family’s ability
to bind and regulate the SMAD4 30 -UTR (Fig. 5c). This finding
illustrates the ability of a single 30 -UTR binding site to undergo
coregulation by multiple microRNA families at a microRNA
‘super-seed’ target site.
PTEN is a conserved TS that regulates the oncogenic phosphoinositide 3-kinase pathway33. TGFBR2 and SMAD4 are
tumour-suppressive components of the transforming growth
factor-b (TGFb) pathway34. ZBTB4 is described as a mediator of
the p53 response35. Thus, the sum of this analysis suggests that the
pan-cancer oncomiR superfamily consisting of miR-17, miR-19
and miR-130 seed families coordinately target multiple critical
tumour-suppressing pathways across tumour types (modelled in
Fig. 7). We focused our analysis on the highest-ranking TS targets
defined in an unbiased pan-cancer analysis of microRNA–target
interactions in this study. Many of the described interactions have
been defined previously9,36–38. However, this study defines these
pathway targets as significant across multiple tumour contexts,
based on an unbiased estimation of microRNA–target interactions
in the largest single data set of human tumours produced to
date. Hundreds of potential, novel interactions between these
microRNAs and other targets are defined in Supplementary Data 7.
The AGO-CLIP atlas reveals mutations in microRNA targets.
To define additional novel mechanisms of microRNA regulation
in tumours, we next integrated the AGO-CLIP data set with
TCGA mutation data to identify somatic SNVs in microRNA
NATURE COMMUNICATIONS | 4:2730 | DOI: 10.1038/ncomms3730 | www.nature.com/naturecommunications
& 2013 Macmillan Publishers Limited. All rights reserved.
8. ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3730
BLCA
–0.342
–0.311
–0.297
–0.324
–0.214
–0.272
–0.234
–0.176
–0.191
–0.237
–0.286
–0.033
0.137
0.074
0.109
0.077
0.045
–0.154
HNSC
–0.211
–0.031
–0.134
–0.182
–0.07
KIRC
–0.055
–0.172
–0.002
0.132
–0.04
–0.178
–0.132
LAML
–0.092
–0.013
–0.008
–0.052
–0.041
0.04
0.322
LUAD
–0.327
–0.252
–0.282
–0.228
–0.211
–0.242
–0.196
–0.14
–0.095
–0.207
–0.097
–0.166
–0.211
OV
–0.259
–0.107
–0.11
–0.157
0.007
0.002
–0.179
READ
–0.163
0.007
–0.204
–0.051
0.136
0.109
–0.124
UCEC
–0.046
–0.15
–0.003
–0.105
–0.082
–0.167
–0.065
–0.175
–0.093
–0.097
0.057
–0.15
–0.042
–0.197
–0.136
–0.148
–0.118
–0.098
–0.156
–0.125
–0.191
COAD
–0.099
0.013
0.134
–0.101
0.065
0.024
–0.196
HNSC
–0.231
–0.15
–0.203
–0.313
–0.147
–0.098
–0.342
–0.125
–0.173
–0.109
–0.253
–0.192
–0.193
–0.085
LAML
–0.313
–0.319
–0.234
–0.173
0.03
–0.006
–0.083
LUAD
–0.323
–0.256
–0.254
–0.251
–0.245
–0.244
–0.317
LUSC
–0.383
–0.286
–0.216
–0.273
–0.186
–0.188
–0.353
OV
–0.114
–0.043
–0.082
–0.067
0.031
0.024
–0.134
READ
0.175
0.027
0.108
–0.139
0.02
0.011
–0.009
UCEC
–0.274
–0.24
–0.19
–0.124
–0.047
–0.118
–0.263
BLCA
–0.365
–0.206
–0.296
–0.137
–0.16
–0.138
–0.27
BRCA
–0.386
–0.367
–0.364
–0.367
–0.328
–0.357
–0.42
COAD
–0.397
–0.188
0.057
–0.243
–0.103
–0.177
–0.239
HNSC
–0.213
–0.142
–0.232
–0.186
–0.12
–0.047
–0.141
KIRC
–0.174
–0.139
–0.144
–0.33
–0.153
–0.205
LAML
–0.137
–0.168
–0.08
–0.191
–0.117
–0.169
–0.279
–0.241
–0.315
–0.279
–0.295
Control
miR-19a
miR-130
1
*
0.8
***
***
0.6
0.4
0.2
Control
miR-17
miR-19a
miR-130b
Mix
**
***
***
***
miR-17
miR-19a
miR-130b
Mix
1.2
1
0.8
0.6
–0.388
–0.435
–0.362
–0.264
–0.231
–0.167
0.4
0.2
–0.314
–0.147
–0.178
–0.112
–0.068
–0.134
0.053
–0.02
–0.183
–0.209
–0.175
–0.247
–0.036
–0.412
–0.256
–0.325
–0.188
–0.202
–0.187
BLCA
0.444
0.337
0.343
0.114
0.371
0.229
–0.31
–0.398
BRCA
0.124
0.066
0.088
–0.031
–0.028
–0.009
0.028
***
–0.212
UCEC
Control
–0.077
–0.276
0
COAD
–0.263
–0.082
–0.034
–0.006
–0.026
–0.021
–0.236
HNSC
–0.011
0.241
–0.027
0.189
–0.044
0.039
0.16
KIRC
–0.09
–0.097
0.024
–0.312
–0.163
–0.14
–0.051
LAML
–0.016
0.254
–0.072
–0.026
–0.024
–0.074
0.051
LUAD
0.06
0.094
0.101
–0.011
0.002
0.007
LUSC
0.149
0.257
0.129
–0.124
0.094
0.062
0.173
OV
0.093
0.009
0.009
0.014
0.007
0.051
0.098
READ
–0.242
–0.019
–0.29
–0.177
0.105
0.043
–0.067
UCEC
0.149
–0.105
0.087
0.09
0.081
0.141
0.154
1.8
SMAD4-3′-UTR luciferase activity/β
galactosidase activity (AU)
OV
READ
44
miR-17
0.2
0
–0.2
–0.258
LUSC
SMAD4
*
–0.199
LUAD
19
*
0.4
0
ZBTB4-3′-UTR luciferase
activity/β galactosidase
activity (AU)
KIRC
ZBTB4
**
0.6
1.2
TGFBR2-3′-UTR luciferase
activity/β galactosidase
activity (AU)
BLCA
14
0.8
–0.22
BRCA
TGFBR2
1
–0.303
LUSC
3
–0.156
–0.25
COAD
PTEN
–0.28
BRCA
PTEN-3′-UTR luciferase
activity/β galactosidase
activity (AU)
hsa-mir-93
hsa-mir-301b
hsa-mir-301a
hsa-mir-130b
hsa-mir-19a
hsa-mir-106a
hsa-mir-17
Tumor type
Tumor
suppressor
rank
Gene_symbol
1.2
***
1.6
1.4
***
1.2
***
Control
1
miR-17
miR-19a
0.8
0.6
**
miR-130b
** **
Mix
**
0.4
0.2
0
SMAD4
SMAD4-ss-Mut
Pearson correlate
–0.3
0
5′
0.3
3′
Mir-17 seed
Mir-130 seed
Mir-17 seed
TCACGT → 6 nucleotide deletion
Figure 5 | The miR-17-19-130 pan-cancer oncomiR superfamily binds and suppresses potent pan-cancer suppressor genes. (a) miR-17-19-130
microRNA–mRNA target correlations across tumours reveals strong negative correlation between superfamily members and their top ranked TS targets
TGFBR2, PTEN and ZBTB4, but not SMAD4. (b) The miR-17-19-130 superfamily is able to coordinately bind and suppress expression of TS 30 -UTR–luciferase
reporter constructs, indicating powerful interaction potential. (c) Superfamily cotargeting on the SMAD4 30 -UTR occurs at a novel microRNA
super-seed locus where multiple microRNA seed families can bind, allowing for potential binding of more than individual microRNAs. Mix, an equimolar
mixture of miR-17, -19a and -130b to demonstrate the co-repressive capacity of the oncomiR superfamily as it would exist in the cellular context. *Po0.05,
**Po0.005, ***Po0.0005, Student’s t-test. Luciferase assays were performed twice at 5 nM mimic and twice at 10 nM in quadruplicate. Results were
combined for final analysis (n ¼ 16). Error bars are s.e.m. microRNA–mRNA correlate n-values are as follows: BLCA ¼ 95, BRCA ¼ 794, COAD ¼ 177,
HNSC ¼ 301, KIRC ¼ 466, LAML ¼ 173, LUAD ¼ 313, LUSC ¼ 193, ovarian carcinoma (OV) ¼ 225, READ ¼ 65 and uterine corpus endometrioid carcinoma
(UCEC) ¼ 320. These numbers reflect the total number of TCGA tumour samples that are characterized with both mRNA and microRNA sequencing.
target sites across tumours. SNV analysis commonly involves
identification of relevant coding-region somatic mutations
through predicted functional impacts of amino acid changes
associated with the nucleotide variations27,39. This process, however,
8
remains imperfect. Many missense mutations are not characterized
as contributing significant functional impact on a gene.
Further, large percentages of coding region mutations are silent.
Finally, the number of mutations outside the coding region of genes
NATURE COMMUNICATIONS | 4:2730 | DOI: 10.1038/ncomms3730 | www.nature.com/naturecommunications
& 2013 Macmillan Publishers Limited. All rights reserved.
9. ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3730
in regulatory regions (50 -UTR and 30 -UTR) outnumbers codingregion mutations. In our analysis of COAD whole-genome
sequencing (WGS) samples, 60% of mutations in coding mRNAs
are in the 30 -UTR, highlighting the potential importance of
mutations in these regions (Fig. 6a).
As disruption of microRNA target sites complementary to the
microRNA seed region directly interferes with microRNA
binding40, any mutation in the microRNA target site complementary to the microRNA’s seed is likely to attenuate microRNA
control of that site. As such, analysing mutations intersecting
with microRNA seed-complementary sites has the potential to
greatly expand the search for relevant cancer mutations by
imbuing silent mutations and 30 -UTR mutations with functional
significance.
To perform microRNA seed-target mutation analysis, we
developed the miSNP algorithm to integrate AGO-CLIP data
with TCGA-defined cancer mutations. miSNP intersects microRNA seed targets with mutation data and retrieves mRNA
expression changes corresponding to mutations in these active
sites (Fig. 6b), providing a powerful method to examine interactions between features and search for subsequent changes in
mRNA associated with the interaction. Using the miSNP
algorithm, we defined thousands of putative microRNA targetsite mutations (Supplementary Data 9). The majority of
TCGA pan-cancer SNV data derives from exome sequencing
focused solely on coding-region sites. Therefore, the majority of
microRNA target mutations we define from the 12 pan-cancer
tumours occur specifically in the coding region. Importantly, it is
Input
Splice_site
Output
Computation
Tumour
mRNA
expression
Silent
Nonsense_mutation
Missense_mutation
AGO-CLIP
microRNA
active sites
In_frame_ins
In_frame_del
Frame_shift_ins
MicroRNA
seed
matches
Frame_shift_del
5′UTR
3′UTR
0
3) Define
active-seedSNVs with
significantly
altered mRNA
levels
2) Define
activeseed-SNVs
1) Define
active
seeds
List of
active,
mutated,
microRNA
seeds
Somatic
mutations
0.1 0.2 0.3 0.4 0.5 0.6 0.7
1.4
NS
***
NS
***
**
*
**
**
1.2
1
0.8
0.6
0.4
SKIL
SKIL-mut
5′
3′
T deletion – 3′-UTR
HIST1H3B HIST1H3B-mut
5′
ANP32E
3′
Fam114A1
Fam114A1-mut
3′
3′ 5′
G→C SNV – coding region silent
Control
miR-142
Control
miR-142
ANP32E-mut
5′
G→A SNV – coding region missense
Control
Let-7
Control
Let-7
Control
miR-9
Control
miR-9
Control
Control
0
miR-17
0.2
miR-17
Luciferase activity/β galactosidase activity (AU)
Fraction of transcriptome mutations in
colorectal cancer
T→A SNV – 3′-UTR
0
Figure 6 | The miSNP algorithm defines mutations in microRNA seeds. (a) 3 -UTR mutations make a disproportionate number of all mutations effecting
mRNAs. (b) Work flow of miSNP algorithm, which integrates TCGA mutation and mRNA expression analysis and AGO-CLIP seed nominations to
determine mutations in active microRNA seeds. (c) Validation of selected AGO-CLIP-defined seed SNVs demonstrates the ability for endogenous tumour
mutations to ablate microRNA regulation in a predictable, seed-dependent manner. (d) Mutations reproduced in each 30 UTR construct matching
endogenous somatic mutations found in the TCGA pan-cancer data as they relate to the cognate microRNA seed. *Po0.05, **Po0.005, NS, not
significant; values measured with Student’s t-test. Assays were performed twice at 5 nM mimic and twice at 10 nM in quadruplicate. Results were
combined for final analysis (n ¼ 16). Error bars are s.e.m.
NATURE COMMUNICATIONS | 4:2730 | DOI: 10.1038/ncomms3730 | www.nature.com/naturecommunications
& 2013 Macmillan Publishers Limited. All rights reserved.
9
10. ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3730
Upstream activation
miR-17-19-130 pan-cancer oncomiR super family
TGFβR2
PI3K/AKT
P53
PTEN
SMAD4
ZBTB4
mTOR
miR-17 seed
SKIL/
SnoN
TGFβ response
P53 response
Proliferation/survival
miR-17 seed family auto regulation through SKIL-repressor targeting.
SKIL 3′-UTR T deletion in miR-17 seed attenuates auto regulation through miR-17 seed
family binding disruption.
Figure 7 | A model of Pan-cancer suppressor pathway regulation by
miR-17-19-130 superfamily as defined by AGO-CLIP analysis. The
microRNA-17-19-130 superfamily heavily targets critical TS in multiple
pathways, including the TGFb pathway, the phosphoinositide 3-kinase/AKT
pathway and the P53 pathway. Additional target site mutation analysis
reveals ablation of a miR-17-mediated negative feedback loop through
mutation of the miR-17 binding site on the SKIL OC 30 -UTR, demonstrating
a novel mechanism of tumour escape from microRNA regulation.
only because we utilize the AGO-CLIP technology that we are
able to define these microRNA target-site mutations at all. In
addition, KIRC exome sequencing has a small number of 30 -UTR
mutations annotated and we incorporated 36 COAD WGS
samples that contain full 30 -UTR SNV annotations.
To demonstrate the ability of the AGO-CLIP technology and
the miSNP algorithm to detect relevant microRNA target-site
mutations, we selected six binding site mutations for experimental validation (Supplementary Data 10). These sites were
chosen based on the number of times the specific binding site was
identified in the AGO-CLIP atlas, the location of the seed in the
coding region or 30 -UTR, whether TargetScan called the site as
highly conserved and the relative location of the mutation within
the seed-complementary region of the binding site. This selection
included three mutations in 30 -UTR sequences and three
mutations in coding sequences to capture the diversity of the
miSNP analysis (Supplementary Data 10). Four of six tested
microRNA binding sites with corresponding somatic mutations
demonstrated strong evidence of microRNA binding and
regulation of the selected site, silencing luciferase expression by
440% in each case and strongly suggesting our analysis identifies
functional microRNA binding sites (Fig. 6c).
For the four target sites with validated target repression, we
reproduced the endogenous tumour mutation in the 30 -UTR–
luciferase construct. The binding site SNVs corresponding to the
miR-17 seed on the SKIL 30 -UTR and the miR-9 seeds on the
HIST1H3B 30 -UTR were able to ablate microRNA binding.
Mutations complementary to the Let-7 seed on ANP32E and
miR-142 seed on FAM114A1 variably reduced, but did not ablate,
the repressive ability of the microRNA on the luciferase reporter
(Fig. 6c). The ability of a mutation to ablate microRNA binding
was directly related to the relative position of the mutation within
the region complementary to the microRNA seed. Mutations in
the first or last nucleotide of the seed complement had a reduced
ability to ablate binding relative to mutations near the centre
of the seed complement (Fig. 6d). These observations are
consistent with established concepts of microRNA binding40
and demonstrate the ability to tier the probable functional impact
10
of microRNA binding site mutations based on its location within
the seed-complementary region of the mRNA target.
One microRNA target site mutation validated in our study was
a deletion of a miR-17 seed family binding site in the SKI-like OC
(SKIL/SnoN) 30 -UTR. SKIL is a known OC and was ranked
in the top 7% of pan-cancer OCs in our pan-cancer mRNA driver
index due to expression gain and copy number amplification
(Supplementary Data 6). SKIL oncogenic function is known to
occur through direct repression of the TGFb signalling pathway,
and part of the TGFb signalling pathway activation involves
targeting and degradation of the SKIL protein41–43.
Targeting of the SKIL-30 -UTR by the miR-17 seed family
reveals potential autoregulatory feedback that can attenuate
silencing of the TGFb pathway by the miR-17-19-130 superfamily.
Mutation of the miR-17 seed-family binding site on the SKIL
30 -UTR may represent a mechanism to allow escape from this
feedback regulation, allowing unregulated SKIL expression while
simultaneously enhancing the oncogenicity of the miR-17 seed
family and creating enhanced suppression of the TGFb pathway
(Fig. 7). The miR-17 seed-family binding site on the SKIL 30 -UTR
is identified in 8 out of 14 AGO-CLIP data sets, indicating
that the site itself is highly active endogenously. Despite this
strong evidence of microRNA binding in the AGO-CLIP atlas,
TargetScan, Pictar and MiRanda, motif calling algorithms13,14,44,45
do not nominate SKIL as a potential target of the miR-17 family,
again highlighting the value of unbiased genome-wide binding
assays as a useful method of determining active microRNA seeds.
Discussion
Combinatorial definition of high-confidence microRNA binding
sites using multiple transcriptome-wide AGO-CLIP data sets
generated clear evidence of endogenous microRNA binding at
specific locations on an mRNA strand. Points of microRNA
interaction tested in this study included multiple microRNA
binding sites, such as the miR-17-SKIL binding site and binding
sites in coding sequences of the mRNA, that are difficult to detect
through other means of microRNA target site prediction.
We found that 45% of broadly conserved pan-cancer oncomiRs
share strong homology in their seed motifs. Seed similarity leads
to redundant cotargeting, and therapeutic suppression of any one
of these microRNAs is likely to face compensation from other
members of the superfamily. The 30 -UTR–luciferase binding
assays and anticorrelates of microRNA–target pairs support the
possibility that these microRNAs redundantly cotarget important
TS across multiple tumour types.
The ability of super-seed target sites to bind multiple members
of the oncogenic superfamily make them an attractive therapeutic
candidate in the future, because it may effectively act as a
microRNA ‘sponge’46 that can bind and titrate off multiple
superfamily members to restore normal cellular regulation in
cancer cells by de-repressing critical TS. This therapy may prove
more effective than targeting a single oncomiR family, because it
has the potential to concurrently sequester multiple oncogenic
microRNA seed families to disrupt redundant oncogenic
co-repression of TS.
Using the miSNP algorithm, we identified thousands of
mutations in microRNA binding sites. These mutations were
discovered in the exome-sequencing-defined coding-region
mutations available from the Pan-Cancer project, a small number
of 30 -UTR mutations available from KIRC exome sequencing and
whole-genome 30 -UTR mutations from 36 COAD WGS samples.
AGO-CLIP characterization of microRNA binding in additional
tissue types and integration of additional 30 -UTR mutations from
broader cohorts of WGS samples will improve the yield of
relevant microRNA target site mutations in the future.
NATURE COMMUNICATIONS | 4:2730 | DOI: 10.1038/ncomms3730 | www.nature.com/naturecommunications
& 2013 Macmillan Publishers Limited. All rights reserved.
11. ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3730
In conclusion, we generated a novel resource, the AGO-CLIP
atlas, to integrate experimentally defined microRNA binding sites
with TCGA tumour data, creating a new method and framework
to understand microRNA regulation of cancer. This method
addresses the difficult question of determining accurate genomewide microRNA–target interactions. Integration of the AGOCLIP-defined microRNA-binding data with TCGA tumour data
revealed several novel insights into microRNA regulation of
human tumours, including the definition of a pan-cancer
oncomiR superfamily and genome-wide identification of microRNA-binding site mutations.
Methods
Argonaute Crosslinking Immunoprecipitation. AGO-CLIP sequence read
archive (SRR) files corresponding to all publicly available human AGO-CLIP
experiments were downloaded from the NIH sequence read archive (SRR codes:
SRR048973, SRR048974, SRR048975, SRR048976, SRR048977, SRR048978,
SRR048979, SRR048980, SRR048981, SRR359787, SRR189786, SRR189787,
SRR189784, SRR189785, SRR189782, SRR189783, SRR580362, SRR580363,
SRR580352, SRR580353, SRR580354, SRR580355, SRR580359, SRR580360,
SRR580361, SRR580356, SRR580357, SRR343336, SRR343337, SRR343334,
SRR343335, SRR592689, SRR592688, SRR592687, SRR592686, SRR592685; data
were downloaded on 11 January 2013). Files were individually pre-processed using
Fastx toolkit and cut-adapt to remove adaptor sequences and control for sequence
quality. Data set quality was individually discerned using Fast QC reader. Individual sequencing runs were tiered and grouped, based first on group publishing, on
cell line used, then on individual treatment and finally based on total reads. In this
manner, 14 independent AGO-CLIP data sets were defined (summarized in
Supplementary Data 1). Individual cell lines with correspondent AGO-CLIP
data include: 293T (4/14 experiments), hESC (1/14 experiments), BCBL1
(1/14 experiments), BC-3 (2/14 experiments), BC-1 (1/14 experiments), LCL35
(1/14 experiments), LCL-BAC (1/14 experiments), LCL-BAC-EBV-infected (2/14
experiments) and EF3D (1/14 experiments). Eleven of these data sets were AGOPAR-CLIP experiments. Three were AGO-HITS-CLIP experiments. Reads from
both experiment types were mapped to hg19 using established Bowtie
parameters47.
Both HITS-CLIP and PAR-CLIP methods utilize ultraviolet crosslinking of
RNA-binding proteins to their respective RNA partners, followed by protein
immunoprecipitation and high-throughput sequencing of the bound RNA.
The difference between the two methods lies in the use of photoactivatable
ribonucleoside analogues in PAR-CLIP data sets, which allow experimental
determination of physical interlinkage between protein–RNA pairs through a
mismatch repair defect initialized at crosslinked nucleoside analogues during
complementary DNA synthesis, leading to T-C transitions in the generated
cDNA. Of the two, the majority of human data sets are PAR-CLIP generated, and
an established pipeline exists for processing of these data forms48. This pipeline
algorithm, termed PARalyzer, uses a kernel-density algorithm centred on crosslinks
to generate putative microRNA target sites. In the PARalyzer algorithm, reads are
first processed into read groups based on total number of reads. T-C transitions
in read groups are then used to define clusters based on the kernel-density
algorithm. Cluster sequences then undergo motif analysis for complements of
microRNA seeds to infer the identity of the microRNA binding partner. PAR-CLIP
Bowtie files were processed through the PARalyzer47 algorithm to generate clusters
and seeds using established parameters47.
AGO-HITS-CLIP Bowtie files were also processed through PARalyzer and
group data were isolated. AGO-HITS-CLIP data groups were superimposed over
microRNA target sites identified by the PAR-CLIP reads. In this way, AGO-CLIP
data sets were allowed to support the strength of a target site identified in the PARCLIP runs by contributing to site recurrence, but were not allowed to perform
de novo target site identification. Meta-analyses concerning the location of clusters
and their density on various segments of the transcriptome (30 -UTR, coding
sequence and 50 -UTR) were performed using only the 11 AGO-PAR-CLIP
libraries.
A lenient seed inference strategy was used, which included all miRBase seed
families. The purpose of this lenient mapping was to anchor redundant read
clusters to the genome to determine seed-site recurrence across all data sets. Using
this strategy, 99% of 123,752 PAR-CLIP clusters mapping to the UCSC known
gene transcriptome received at least one seed inference, although likely to be at the
expense of false positives in less-expressed microRNAs. From these cluster
sequences, 306,733 microRNA seed-complementary sequences were inferred.
Multiple seed complements may be inferred from a single cluster sequence and
these seeds often overlap a single site. These sites often highlight putative
microRNA super-seed targets, which are readily apparent in AGO-CLIP data. The
identity of the actual binding partner may be one of the complementary
microRNAs, all of them, or may represent a form of non-canonical binding that is
not currently considered in our motif analysis25,48.
Following target identification, the seed complements of each target were
grouped for recurrence across the 11 PAR-CLIP data sets. To this, 3 HITS-CLIP
read groups were intersected to combine data from 14 total AGO-CLIP sources.
PAR-CLIP clusters and HITS-CLIP groups were then permutated across the
genome 20 times using BedTools49 and analysis was performed to determine
the likelihood of a given target being recurrently identified by chance after
randomization. A FDR was assigned based on binomial P-values established from
the actual measured probability of a seed-complementary site recurring at
random based on random distribution of the target sites across the transcriptome.
We determined target site recurrence of three or more corresponded to a
Q-value o0.05.
TCGA data acquisition. All TCGA data, except for microRNA expression, were
downloaded from the Synapse archive as part of the TCGA Pan-Cancer project and
correspond to TCGA pan-cancer whitelist files (originally downloaded 25 January
2013). The TCGA Pan-Cancer project consists of 12 available tumour types that
include COAD, READ, LUAD, LUSC, BLCA, BRCA, GBM, UCEC, KIRC, LAML,
OV and HNSC. Some components of available data are currently incomplete, such
as missing normal microRNA-sequencing samples for LAML, ovarian carcinoma,
GBM, COAD and READ.
MicroRNA data were compiled individually from the TCGA data portal to
analyse microRNA isoform data, which were not present on Synapse at the time of
analysis. MicroRNA data were processed directly from the TCGA data portal
isoform files for all whitelist tumours as of 20 November 2012. Multiple reads from
an individual isoform were collapsed into a single read count; the reads per million
microRNAs mapped data form was used, which establishes each microRNA read
count as a fraction of the total microRNA population. MicroRNA-sequencing data
used in this study are summarized in Supplementary Table S1. MicroRNA data
underwent upper-quartile normalization50 using the edgeR software package51,
which produced the best overall normalization results compared with reads per
million or trimmed median of M-values52 normalization, followed by
determination of significant differences between tumour and normal samples using
a Fisher’s exact test with Bonferroni correction to determine FDRs.
As part of the Pan-Cancer project, some processed data were available for
second-line analysis. These data included pan-cancer CNV data processed through
the ABSOLUTE-GISTIC31 pipeline and mutation data processed through the
MuSiC28 suite and the MSKCC driver analytical pipeline, which were incorporated
into driver gene nominations. A list of data IDs in Synapse is provided in
Supplementary Table S4.
Pan-cancer oncomiR and miR-suppressor selection. Our goal in nominating
pan-cancer oncomiRs and miR suppressors was to determine microRNAs that
change consistently in the same direction across most cancers. Thus, we set a
stringent (qo0.05, Fisher’s exact test with Bonferroni correction for multiple
testing) threshold comparing tumour versus normal microRNA expression, and
required pan-cancer oncomiRs to have increased expression in six out of seven
tumour types with available tumour versus normal data. Alternately, pan-cancer
miR suppressors were required to have decreased expression in six out of seven
tumour types. Finally, when determining microRNAs to analyse for microRNA–
target interactions, we additionally filtered for dominant isoforms in broadly
conserved microRNA families that have peaks identified in at least 3 of 14 AGOCLIP data sets. This helped limit false positives in the microRNA–target analysis by
ensuring the interactions we observed consisted of conserved microRNA seeds
derived from microRNAs expressed in the AGO-CLIP cell lines.
TS and OC definitions. This analysis utilized three external data sets generated for
the purpose of pan-cancer analysis by the TCGA: MuSiC, MSKCC driver target
analysis and GISTIC. The strength of the MuSiC algorithm, developed at Washington
University25, is its ability to quality-control SNV samples, eliminate outliers (such as
certain hypermutated samples) and derive P-values to determine significantly
mutated genes versus the background mutation rate. We thus utilize MuSiC P-values
and mutation frequencies in our analysis. Specific mutations are likely to either
activate or inactivate genes, and definition of these mutation sites in a single gene can
ultimately define that gene as an OC or TS. The MSKCC algorithm29 creates a binary
definition of SNVs that we are able to use to stratify mutated genes as either OCs or
TS based on a functional impact score that weighs the probable impact of mutation at
a specific amino acid residue. Finally, GISTIC30, developed at the Broad Institute, is
an algorithm that controls for sample quality and low-amplitude copy number shifts
in CNV data derived from single-nucleotide polymorphism arrays. We utilized
processed GISTIC data to define CNV log ratios and set CNV thresholds. Our
mRNA q-values are performed in house and set at stringent, common threshold
value (qo0.005, Student’s t-test with Bonferroni correction).
A ranking system was developed, which integrates TCGA mRNA, CNV and
mutation data analysed by TCGA data available from the MuSiC, MSKCC driver
target and GISTIC algorithms. This system equally weighted CNV, mRNA
expression change and gene mutations as three orthogonal methods of identifying
TS and OCs across tumours. This method generated a continuous ranked list for
every gene, based on consistent changes across tumours, ranging from more
negative (TS) to more positive (OCs).
For mRNA data, þ 1 point was given for each of seven tumours with
microRNA-seq data available, in which there was a tumour versus normal
NATURE COMMUNICATIONS | 4:2730 | DOI: 10.1038/ncomms3730 | www.nature.com/naturecommunications
& 2013 Macmillan Publishers Limited. All rights reserved.
11
12. ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3730
significant increase (Student’s t-test, qo0.005), and À 1 point was assigned for
significant decreases in mRNA expression. For CNV data, GISTIC scores for
HUGO-gene-annotated locus copy-number changes were used. We set an
amplification/deletion threshold of 0.3 or À 0.3 for each sample. For each whitelist
tumour in which a given gene achieved amplification or in 30% of samples, þ 0.5
or À 0.5 points was awarded accordingly. Finally, for mutation scores, MSKCC and
MuSiC mutation analyses were both integrated. Gene mutations were only
considered based on MuSiC-determined Fisher’s combined P-test FDR qo0.005.
Mutation frequency was multiplied by 100 and then by 1 or À 1, based on
MSKCC driver analysis as a TS (* À 1) or an OG (*1). Genes nominated as both TS
and OG are negated. Truncating mutations based on MSKCC truncation tabulation
were assigned additional significance, and the fraction of truncations/total
mutations was multiplied by À 5 to attribute additional negative value to any gene
with high frequency of truncating mutations. All scores were then summed to
generate a final pan-cancer TS versus OC score. In sum, this analysis generated a
continuous negative-to-positive scale that ranked pan-cancer drivers based on
consistent mRNA, CNV or mutation changes across tumours. These three
values were scaled to have roughly equal weight to place equal emphasis on the
three orthogonal technologies used in the analysis. The scoring equation is
described below:
Final score ¼ðmRNA increasesÞ À ðmRNA decreasesÞ
þ ð0:5ÞðCNV amplificationÞ À ð0:5ÞðCNV deletionÞ
þ ð100Þðmutation frequency across all tumoursÞ
ð Æ 1 MSKCC driversÞ À ð5Þðtruncation frequencyÞ
ð1Þ
There are six tumours with tumour versus normal mRNA (BLCA, BRCA,
HNSC, KIRC, LUAD and LUSC). All tumours contain CNV data and mutation
data. Mutation frequency must be significant (qo0.005, Fisher’s exact test) to be
considered at all. The MSKCC mutation analysis assigns þ 1 or À 1, based on
whether mutations in a given gene activate or inactivate the gene in question. Final
score is weighted so that each independent technology contributes equally to
overall scoring. Mutation score is highly dominant in several genes such as TP53
with very high mutation frequencies (B50% of all tumours).
MicroRNA–target enrichment calculations. To define an optimal method of
determining microRNA–target interactions, we intersected pan-cancer oncomiRs
and miR suppressors with pan-cancer TS versus pan-cancer OCs based on four
different possible approaches that included the following: using all AGO-CLIPdefined binding sites without considering site conservation (for example,
TargetScan), using only AGO-CLIP-defined sites with Z3 occurrences (corresponding to a significant peak based on random permutation) without considering
TargetScan, TargetScan-only binding sites, and finally by combining AGO-CLIPdefined target sites with Z3 occurrences, or Z1 occurrences and a TargetScan call
(Supplementary Fig. S2). Only well-conserved TargetScan calls were considered in
this analysis. To calculate enrichments, the per cent of total targets per microRNA
defined as TS as compared with the per cent of total number of targets defined as
OCs for all genes in the top 100, 250, 500, 1,000, 1,500, 2,000, 2,500 and 3,000 TS
versus OCs (see Supplementary Fig. S2 for all levels of data and Supplementary
Data 6 for complete mRNA driver analysis).
The average per cent of TS versus OC targets was compared for oncomiRs and
miR suppressors using Student’s t-test based on the following equation:
For n number of OC or TS ranked in the top 3,000 (Supplementary Data 6),
where n ¼ 1-3,000;
x ðTS targets per microRNA=total targets per microRNAÞ
versus ðStudents t-testÞ
ð2Þ
xðOC targets per microRNA=total targets per microRNAÞ:
Cotargeting representation was performed with the Venn Diagram package in R.
MicroRNA pan-cancer correlations. Two sets of correlations were used in this
study. The first was microRNA to microRNA correlation for miR-17-19-130
family members identified as pan-cancer oncomiRs. These correlates consist of a
simple microRNA-to-microRNA Pearson’s R2 value. To generate a null distribution, 100 mature, dominant isoform microRNAs were randomly selected and
correlated to all randomly selected microRNAs across tumours. MicroRNAs that
were not expressed in a given tumour were filtered out of the analysis. This
generated a null distribution of random microRNA correlates (average Pearson’s
R2 ¼ 0.02, s.d. ¼ 0.11) to which the miR-17-19-130 family correlates were
compared.
Similarly, for microRNA–mRNA targets, 100 random microRNAs were
correlated to 200 random genes. This created a null microRNA–mRNA correlation
(average Pearson’s R2 ¼ 0.005, s.d. ¼ 0.10). To this combined correlations between
the miR-17-19-130 pan-cancer microRNAs and the PTEN, TGFBR2, SMAD4 and
ZBTB4 genes were compared to establish P-values.
AGO-CLIP SNV intersection. We developed the AGO-CLIP SNV intersection
(miSNP) algorithm and software package to investigate the effects of microRNAs
on tumour samples by integrating exome-sequencing data, AGO PAR-CLIP
12
microRNA/mRNA binding results and RNA-seq gene expression data across
multiple TCGA data sets. miSNP performs two types of integration. First, by using
only the mutation data and the AGO PAR-CLIP microRNA/mRNA binding sites,
miSNP aggregates and reports at gene level both the microRNA binding sites and
the mutations. Data collected for a particular gene include the individual microRNAs targeting the gene, as well as the types of mutations in the microRNA
binding sites. Next, miSNP incorporates RNA-seq gene expression data to enable a
user to carry out a quantitative analysis of the effects of microRNAs and mutations
on gene expression across a TCGA data set. The algorithm operates on a gene-bygene basis. It first partition the tumour samples into those that contain mutations
in microRNA binding site; the algorithm can be customized to consider only
particular mutation types (for example, coding, silent). Next, it reports the gene
expression data for these genes and samples in a tabular format for further analysis.
Finally, it identifies genes for which the expression associates with the mutation
status in microRNA binding sites by comparing the gene expression distributions
of tumour samples with or without common sites using a two tailed Welch’s t-test;
miSNP reports both the fold-change and the t-test P-value for each gene. miSNP
was developed in Python utilizing the Numpy and Scipy modules. In the current
paper, we analysed all AGO-CLIP-defined microRNA target sites for mutations to
define a global perspective of possible interactions, but selected sites for validation
only from interactions with Z3 occurrences corresponding to a non-random
event.
The miSNP software package (Supplementary Software 1) is an open source
(Free BSD license) and is available for community download at: www.genboree.
org/miSNP.
Luciferase assays. The PTEN 30 -UTR–luciferase reporter constructs were purchased from Addgene (Cambridge, MA)26,53. All other constructs were purchased
directly from Switchgear Genomics (Palo Alto, CA) and generated using
pLightSwitch_30 -UTR vectors. MicroRNA MirVana mimics and negative control
mimic were purchased from Ambion (Grand Island, NY). Luciferase assays were
performed using the LightSwitch assay kit according to the SwitchGear LightSwitch
luciferase assay kit protocols. Reporter plasmids and microRNA mimics were
transfected using Lipofectamine 2000 (Life Technologies, Grand Island, NY).
Luciferase activity was normalized to b-galactosidase activity. For coding-region
constructs, synthetic binding assays were generated by placing the coding region
downstream of the luciferase reporter in the UTR. All binding assays were
performed in confluent HEK293T cells for 24 h at 5 and 10 nM concentrations
microRNA mimic and 40 ng of luciferase and b-galactosidase vector. Experiments
were replicated twice at 5 nM and twice at 10 nM experimental concentrations in
quadruplicate, and results were combined for final statistical analysis. Insert
sequences novel to this study are available in Supplementary Data 11. HEK293T
cells were a kind gift from Dr Weiwen Long.
References
1. Selbach, M. et al. Widespread changes in protein synthesis induced by
microRNAs. Nature 455, 58–63 (2008).
2. Mukherji, S. et al. MicroRNAs can generate thresholds in target gene
expression. Nat. Genet. 43, 854–859 (2011).
3. Guo, H., Ingolia, N. T., Weissman, J. S. & Bartel, D. P. Mammalian microRNAs
predominantly act to decrease target mRNA levels. Nature 466, 835–840
(2010).
4. Lu, J. et al. MicroRNA expression profiles classify human cancers. Nature 435,
834–838 (2005).
5. Martello, G. et al. A microRNA targeting dicer for metastasis control. Cell 141,
1195–1207 (2010).
6. Kumar, M. S., Lu, J., Mercer, K. L., Golub, T. R. & Jacks, T. Impaired
microRNA processing enhances cellular transformation and tumorigenesis.
Nat. Genet. 39, 673–677 (2007).
7. Volinia, S. et al. Reprogramming of miRNA networks in cancer and leukemia.
Genome Res. 20, 589–599 (2010).
8. Darido, C. et al. Targeting of the tumor suppressor GRHL3 by a miR-21dependent proto-oncogenic network results in PTEN loss and tumorigenesis.
Cancer Cell 20, 635–648 (2011).
9. Olive, V. et al. miR-19 is a key oncogenic component of mir-17-92. Genes Dev.
23, 2839–2849 (2009).
10. Conkrite, K. et al. miR-17B92 cooperates with RB pathway mutations to
promote retinoblastoma. Genes Dev. 25, 1734–1745 (2011).
11. Bartel, D. P. MicroRNAs: target recognition and regulatory functions. Cell 136,
215–233 (2009).
12. Wang, Y. et al. Structure of an argonaute silencing complex with a
seed-containing guide DNA and target RNA duplex. Nature 456, 921–926
(2008).
13. Friedman, R. C., Farh, K. K., Burge, C. B. & Bartel, D. P. Most mammalian
mRNAs are conserved targets of microRNAs. Genome Res. 19, 92–105 (2009).
14. Lewis, B. P., Burge, C. B. & Bartel, D. P. Conserved seed pairing, often flanked
by adenosines, indicates that thousands of human genes are microRNA targets.
Cell 120, 15–20 (2005).
NATURE COMMUNICATIONS | 4:2730 | DOI: 10.1038/ncomms3730 | www.nature.com/naturecommunications
& 2013 Macmillan Publishers Limited. All rights reserved.
13. ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3730
15. Grimson, A. et al. MicroRNA targeting specificity in mammals: determinants
beyond seed pairing. Mol. Cell 27, 91–105 (2007).
16. Chi, S. W., Zang, J. B., Mele, A. & Darnell, R. B. Argonaute HITS-CLIP
decodes microRNA-mRNA interaction maps. Nature 460, 479–486 (2009).
17. Hafner, M. et al. Transcriptome-wide identification of RNA-binding protein
and microRNA target sites by PAR-CLIP. Cell 141, 129–141 (2010).
18. Hafner, M., Lianoglou, S., Tuschl, T. & Betel, D. Genome-wide identification of
miRNA targets by PAR-CLIP. Methods 58, 94–105 (2012).
19. The Cancer Genome Atlas Research Network. The Cancer Genome Atlas PanCancer analysis project. Nat Genet. 45, 1113–1120 (2013).
20. Skalsky, R. L. et al. The viral and cellular microRNA targetome in
lymphoblastoid cell lines. PLoS Pathog. 8, e1002484 (2012).
21. Lipchina, I. et al. Genome-wide identification of microRNA targets in human
ES cells reveals a role for miR-302 in modulating BMP response. Genes Dev. 25,
2173–2186 (2011).
22. Gottwein, E. et al. Viral microRNA targetome of KSHV-infected primary
effusion lymphoma cell lines. Cell Host Microbe 10, 515–526 (2011).
23. Kishore, S. et al. A quantitative analysis of CLIP methods for identifying
binding sites of RNA-binding proteins. Nat. Methods 8, 559–564 (2011).
24. Haecker, I. et al. Ago HITS-CLIP expands understanding of Kaposi’s sarcomaassociated herpesvirus miRNA function in primary effusion lymphomas. PLoS
Pathog. 8, e1002884 (2012).
25. Helwak, A., Kudla, G., Dudnakova, T. & Tollervey, D. Mapping the human
miRNA interactome by CLASH reveals frequent noncanonical binding. Cell
153, 654–665 (2013).
26. O’Donnell, K. A., Wentzel, E. A., Zeller, K. I., Dang, C. V. & Mendell, J. T.
c-Myc-regulated microRNAs modulate E2F1 expression. Nature 435, 839–843
(2005).
27. Dees, N. D. et al. MuSiC: identifying mutational significance in cancer
genomes. Genome Res. 22, 1589–1598 (2012).
28. Kandoth, C. et al. Mutational landscape and significance across 12 major cancer
types. Nature 502, 333–339 (2013).
29. Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein
mutations: application to cancer genomics. Nucleic Acids Res. 39, e118 (2011).
30. Mermel, C. H. et al. GISTIC2.0 facilitates sensitive and confident localization of
the targets of focal somatic copy-number alteration in human cancers. Genome
Biol. 12, R41 (2011).
31. Zack, T. I. et al. Pan-cancer patterns of somatic copy number alteration. Nat.
Genet. 45, 1134–1140 (2013).
32. Hong, L. et al. The miR-17-92 cluster of microRNAs confers tumorigenicity by
inhibiting oncogene-induced senescence. Cancer Res. 70, 8547–8557 (2010).
33. Song, M. S., Salmena, L. & Pandolfi, P. P. The functions and regulation of the
PTEN tumour suppressor. Nat. Rev. Mol. Cell Biol. 13, 283–296 (2012).
34. Massague, J. TGFbeta in cancer. Cell 134, 215–230 (2008).
35. Weber, A. et al. Zbtb4 represses transcription of P21CIP1 and controls the
cellular response to p53 activation. EMBO J. 27, 1563–1574 (2008).
36. Li, L., Shi, J. Y., Zhu, G. Q. & Shi, B. MiR-17-92 cluster regulates cell
proliferation and collagen synthesis by targeting TGFB pathway in mouse
palatal mesenchymal cells. J. Cell. Biochem. 113, 1235–1244 (2012).
37. Mestdagh, P. et al. The miR-17-92 microRNA cluster regulates multiple
components of the TGF-beta pathway in neuroblastoma. Mol. Cell 40, 762–773
(2010).
38. Kim, K. et al. Identification of oncogenic microRNA-17-92/ZBTB4/specificity
protein axis in breast cancer. Oncogene 31, 1034–1044 (2012).
39. Banerji, S. et al. Sequence analysis of mutations and translocations across breast
cancer subtypes. Nature 486, 405–409 (2012).
40. Doench, J. G. & Sharp, P. A. Specificity of microRNA target selection in
translational repression. Genes Dev. 18, 504–511 (2004).
41. Solomon, E., Li, H., Duhachek Muggy, S., Syta, E. & Zolkiewska, A. The role of
SnoN in transforming growth factor beta1-induced expression of
metalloprotease-disintegrin ADAM12. J. Biol. Chem. 285, 21969–21977 (2010).
42. Tecalco-Cruz, A. C. et al. Transforming growth factor-beta/SMAD Target gene
SKIL is negatively regulated by the transcriptional cofactor complex SNONSMAD4. J. Biol. Chem. 287, 26764–26776 (2012).
43. Bonnie, S. et al. TGF-b induces assembly of a Smad2–Smurf2 ubiquitin
ligase complex that targets SnoN for degradation. Nat. Cell Biol. 3, 587–595
(2001).
44. Krek, A. et al. Combinatorial microRNA target predictions. Nat. Genet. 37,
495–500 (2005).
45. Betel, D., Wilson, M., Gabow, A., Marks, D. S. & Sander, C. The microRNA.org
resource: targets and expression. Nucleic Acids Res. 36, D149–D153 (2008).
46. Ebert, M. S., Neilson, J. R. & Sharp, P. A. MicroRNA sponges: competitive
inhibitors of small RNAs in mammalian cells. Nat. Methods 4, 721–726 (2007).
47. Corcoran, D. L. et al. PARalyzer: definition of RNA binding sites from PARCLIP short-read sequence data. Genome Biol. 12, R79 (2011).
48. Loeb, G. B. et al. Transcriptome-wide miR-155 binding map reveals widespread
noncanonical microRNA targeting. Mol. Cell 48, 760–770 (2012).
49. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing
genomic features. Bioinformatics 26, 841–842 (2010).
50. Bullard, J. H., Purdom, E., Hansen, K. D. & Dudoit, S. Evaluation of statistical
methods for normalization and differential expression in mRNA-Seq
experiments. BMC Bioinformatics 11, 94 (2010).
51. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor
package for differential expression analysis of digital gene expression data.
Bioinformatics 26, 139–140 (2010).
52. Robinson, M. D. & Oshlack, A. A scaling normalization method for differential
expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
53. Burkhart, D. L. et al. Regulation of RB transcription in vivo by RB family
members. Mol. Cell Biol. 30, 1729–1745 (2010).
Acknowledgements
We gratefully acknowledge the contributions from the TCGA Research Network and its
TCGA Pan-Cancer Analysis Working Group (contributing consortium members are
listed in the Supplementary Note 1). The TCGA Pan-Cancer Analysis Working Group is
coordinated by J.M. Stuart, C. Sander and I. Shmulevich. This work was supported by the
Caroline Wiess Law Foundation (S.E.M); Dan L. Duncan Cancer Center Scholar Award
(S.E.M); S.E.M. is a member of the Dan L. Duncan Cancer Center supported by the
National Cancer Institute Cancer Center Support Grant P30CA125123. We acknowledge
the joint participation by the Diana Helis Henry Medical Research Foundation through its
direct engagement in the continuous active conduct of medical research in conjunction
with Baylor College of Medicine Baylor Research Advocates for Student Scientists
(BRASS) Foundation (M.P.H); the Robert and Janice McNair Foundation (M.P.H); and
NIH 1K01DK096093 (S.M.H.) with additional funding provided by the Diabetes and
Endocrinology Research Center (P30-DK079638) at Baylor College of Medicine. A special
thanks to Kat Harris and the Switchgear Genomics team for fast, efficient and kind
service. We thank Robb Moses, David Bader and Joel Neilson for editing contributions.
Author contributions
M.P.H. contributed to study design and interpretation, construct design, wet lab
experimentation, AGO-CLIP data set compilation and paper text. K.R. and C.C. helped
in generation of miSNP algorithm and paper text. S.M.H. contributed to study design
and paper editing. P.H.G., R.A.G. and D.A.W. contributed to study design. C.K., M.D.M.
and L.D. contributed to generation of MuSiC pan-cancer analysis. B.R. contributed to
generation of MSKCC pan-cancer functional driver predictions. T.Z. contributed to
generation of pan-cancer ABSOLUTE-GISTIC analysis. S.E.M is the senior contributing
author and helped in study design and interpretation, and paper editing.
Additional information
Supplementary Information accompanies this paper at http://www.nature.com/
naturecommunications
Competing financial interests: The authors declare no competing financial interests.
Reprints and permission information is available online at http://npg.nature.com/
reprintsandpermissions/
How to cite this article: Hamilton, M. P. et al. Identification of a pan-cancer oncogenic
microRNA superfamily anchored by a central core seed motif. Nat. Commun. 4:2730
doi: 10.1038/ncomms3730 (2013).
This work is licensed under a Creative Commons AttributionNonCommercial-ShareAlike 3.0 Unported License. To view a copy of
this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/
NATURE COMMUNICATIONS | 4:2730 | DOI: 10.1038/ncomms3730 | www.nature.com/naturecommunications
& 2013 Macmillan Publishers Limited. All rights reserved.
13