Gene Expression Omnibus
INFO-703 Biological Data Management
Feb 6th, 2019
Presented by Thi Nguyen
Outline
• Background
• GEO database content
• Data types
• Schema
• 10 ways to use GEO resources
• Conclusion
• Demo
Gene expression Omnibus (GEO)-
The basics
• http://www.ncbi.nlm.nih.gov/geo/
• GEO = earth, 72 nations
• global public repository for high-throughput gene expression data
(raw, processed, metadata)
• https://www.ncbi.nlm.nih.gov/geo/summary/
• 1995 saw the rise in microarray technology, which enabled scientists to assay
hundred of thousand genes at the same time
• 2000, National Center for Biotechnology Information (NCBI) launched GEO database
• 2002, major journals started to require deposit of data into public repositories
• GEO provides free access to data + offer Web-based tools to search and
analyze/visualize data (directly on website, no need to download)
From: NCBI GEO: archive for functional genomics data sets—update
Nucleic Acids Res. 2012;41(D1):D991-D995. doi:10.1093/nar/gks1193
Nucleic Acids Res | Published by Oxford University Press 2012.This is an Open Access article distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial reuse,
distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact
journals.permissions@oup.com.
GEO Content
Gene expression Omnibus (GEO)-
Data submissions
Gene expression Omnibus (GEO)-
Data types
GEO’s 3 data entities:
1. platform = list of probes that define what set of molecules to be detected
2. sample = set of molecules being probed
3. series = organize samples into meaningful data sets, which makes up an experiment
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC99122/
GEO schema
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC99122/
10 ways to use GEO resources
1. Retrieve specific GEO record
2. Quick search using key words
• GEO Datasets (experiment-centric) http://www.ncbi.nlm.nih.gov/gds/
• GEO Profiles (gene-centric) http://www.ncbi.nlm.nih.gov/geoprofiles/
3. Advanced searches using structured queries
https://www.ncbi.nlm.nih.gov/geo/info/qqtutorial.html
4. Search programmatically
https://www.ncbi.nlm.nih.gov/geo/info/geo_paccess.html
5. Query with a nucleotide sequence
https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&BLAST_SPEC=GeoBlast&PAGE_TYPE=BlastSearch
10 ways to use GEO resources
6. Data set analysis tools
• find genes
• compare 2 sets of samples
• Cluster heatmaps
• experiment design and value distribution
7. GEO profiles analysis tools
• Profile Neighbors
• Chromosome Neighbors
• Sequence Neighbors
• Homologene Neighbors
• Download Profile Data button
• Find Pathways buttons
8. Analyze with GEO2R
• immediate, web-based, user-driven interactive analysis tool
• employs Bioconductor packages GEOQuery, limma with the Benjamini-Hochberge
FDR method for multiple testing correction as default method
10 ways to use GEO resources
6. Data set analysis tools
• find genes
• compare 2 sets of samples
• Cluster heatmaps
• experiment design and value distribution
7. GEO profiles analysis tools
• Profile Neighbors
• Chromosome Neighbors
• Sequence Neighbors
• Homologene Neighbors
• Download Profile Data button
• Find Pathways buttons
8. Analyze with GEO2R
• immediate, web-based, user-driven interactive analysis tool
• employs Bioconductor packages GEOQuery, limma with the Benjamini-Hochberge
FDR method for multiple testing correction as default method
10 ways to use GEO resources
9. Visualize data as genome track
10. Download GEO data
• SOFT ( Simple Omnibus Format in text) formatted family: download all series in 1 file
• MINiML: XML of SOFT format
• Series Matrix File: tab-delimited value matrix
GEO for data mining/ reuse of public data
• fast, efficient accessioning of data. Robust, versatile, resourceful database.
• free to public
• flexible query + download approaches
• Many free and user friendly web-based tools, without the need of download and
process data.
• GEO data have been reused, reanalyzed resulting in thousands of publications:
(http://www.ncbi.nlm.nih.gov/geo/info/citations.html)
• Potential uses
 test algorithms
 create new subject-specific databases
 disease biomarkers
 further characterize gene functions

Gene Expression Omnibus (GEO)

  • 1.
    Gene Expression Omnibus INFO-703Biological Data Management Feb 6th, 2019 Presented by Thi Nguyen
  • 2.
    Outline • Background • GEOdatabase content • Data types • Schema • 10 ways to use GEO resources • Conclusion • Demo
  • 3.
    Gene expression Omnibus(GEO)- The basics • http://www.ncbi.nlm.nih.gov/geo/ • GEO = earth, 72 nations • global public repository for high-throughput gene expression data (raw, processed, metadata) • https://www.ncbi.nlm.nih.gov/geo/summary/ • 1995 saw the rise in microarray technology, which enabled scientists to assay hundred of thousand genes at the same time • 2000, National Center for Biotechnology Information (NCBI) launched GEO database • 2002, major journals started to require deposit of data into public repositories • GEO provides free access to data + offer Web-based tools to search and analyze/visualize data (directly on website, no need to download)
  • 4.
    From: NCBI GEO:archive for functional genomics data sets—update Nucleic Acids Res. 2012;41(D1):D991-D995. doi:10.1093/nar/gks1193 Nucleic Acids Res | Published by Oxford University Press 2012.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com. GEO Content
  • 5.
    Gene expression Omnibus(GEO)- Data submissions
  • 6.
    Gene expression Omnibus(GEO)- Data types GEO’s 3 data entities: 1. platform = list of probes that define what set of molecules to be detected 2. sample = set of molecules being probed 3. series = organize samples into meaningful data sets, which makes up an experiment https://www.ncbi.nlm.nih.gov/pmc/articles/PMC99122/
  • 7.
  • 8.
    10 ways touse GEO resources 1. Retrieve specific GEO record 2. Quick search using key words • GEO Datasets (experiment-centric) http://www.ncbi.nlm.nih.gov/gds/ • GEO Profiles (gene-centric) http://www.ncbi.nlm.nih.gov/geoprofiles/ 3. Advanced searches using structured queries https://www.ncbi.nlm.nih.gov/geo/info/qqtutorial.html 4. Search programmatically https://www.ncbi.nlm.nih.gov/geo/info/geo_paccess.html 5. Query with a nucleotide sequence https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&BLAST_SPEC=GeoBlast&PAGE_TYPE=BlastSearch
  • 9.
    10 ways touse GEO resources 6. Data set analysis tools • find genes • compare 2 sets of samples • Cluster heatmaps • experiment design and value distribution 7. GEO profiles analysis tools • Profile Neighbors • Chromosome Neighbors • Sequence Neighbors • Homologene Neighbors • Download Profile Data button • Find Pathways buttons 8. Analyze with GEO2R • immediate, web-based, user-driven interactive analysis tool • employs Bioconductor packages GEOQuery, limma with the Benjamini-Hochberge FDR method for multiple testing correction as default method
  • 10.
    10 ways touse GEO resources 6. Data set analysis tools • find genes • compare 2 sets of samples • Cluster heatmaps • experiment design and value distribution 7. GEO profiles analysis tools • Profile Neighbors • Chromosome Neighbors • Sequence Neighbors • Homologene Neighbors • Download Profile Data button • Find Pathways buttons 8. Analyze with GEO2R • immediate, web-based, user-driven interactive analysis tool • employs Bioconductor packages GEOQuery, limma with the Benjamini-Hochberge FDR method for multiple testing correction as default method
  • 11.
    10 ways touse GEO resources 9. Visualize data as genome track 10. Download GEO data • SOFT ( Simple Omnibus Format in text) formatted family: download all series in 1 file • MINiML: XML of SOFT format • Series Matrix File: tab-delimited value matrix
  • 12.
    GEO for datamining/ reuse of public data • fast, efficient accessioning of data. Robust, versatile, resourceful database. • free to public • flexible query + download approaches • Many free and user friendly web-based tools, without the need of download and process data. • GEO data have been reused, reanalyzed resulting in thousands of publications: (http://www.ncbi.nlm.nih.gov/geo/info/citations.html) • Potential uses  test algorithms  create new subject-specific databases  disease biomarkers  further characterize gene functions

Editor's Notes

  • #5 Figure 1. Distribution of the number and types of selected studies released by GEO each year since inception. Users can explore and download historical submission numbers using the ‘history’ page at http://www.ncbi.nlm.nih.gov/geo/summary/?type=history, as well as constructing GEO DataSet database queries for specific data types and date ranges using the ‘DataSet type’ and ‘publication date’ fields as described at http://www.ncbi.nlm.nih.gov/geo/info/qqtutorial.html.
  • #6 GEO segregates data into three principle components, platform, sample and series (Table ​(Table2),2), each of which is accessioned (i.e. given a unique and constant identifier) in a relational database
  • #7 GEO segregates data into three principle components, platform, sample and series (Table ​(Table2),2), each of which is accessioned (i.e. given a unique and constant identifier) in a relational database An instance of a platform is, essentially, a list of probes that define what set of molecules may be detected in any experiment utilizing that platform. For example, the platform data table may contain GEO-defined columns identifying the position and biological reagent contents of each probe (spot) such as a GenBank accession number, open reading frame (ORF) name and clone identifier, as well as submitter-defined columns. Platform accession numbers have a ‘GPL’ prefix. An instance of a sample describes the derivation of the set of molecules that are being probed and utilize platforms to generate molecular abundance data. Each sample has one, and only one, parent platform which must be previously defined. For example, a sample data table may contain columns indicating the final, relevant abundance value of the corresponding spot defined in its platform, as well as any other GEO-defined (e.g. raw signal, background signal) and submitter-defined columns. Sample accession numbers have a ‘GSM’ prefix. An instance of a series organizes samples into the meaningful data sets which make up an experiment, and are bound together by a common attribute. Series accession numbers have a ‘GSE’ prefix.
  • #9 “hepatocellular carcinoma” into the GEO DataSets search box to retrieve all the DataSet, Series, and Sample records that mention that term. Similarly, if a user is studying the gene CREB5, it is only necessary to type “CREB5” into the GEO Profiles search box to retrieve all gene expression profile records for that gene across all DataSets. The GEO BLAST search function provides the opportunity to perform a sequence-based search against GEO using either a nucleotide sequence or a GenBank accession number. 
  • #10  The GEO BLAST search function provides the opportunity to perform a sequence-based search against GEO using either a nucleotide sequence or a GenBank accession number.