SlideShare a Scribd company logo
1 of 1
Automated Data Pipelines for Loading, Integrating, Annotating and Quality Control of Data at the Rat Genome Database Abstract As the richness and diversity of biological data increase, model organism databases are confronted with the problem of quickly and efficiently populating their databases as well as providing timely updates to the information that they store. The Rat Genome Database (RGD, http://rgd.mcw.edu) provides comprehensive rat genetic, genomic and biological data through both manual and automated curation processes. A series of automated data pipelines have been implemented to acquire various data types from multiple sources, integrate them with existing data and provide comprehensive quality control data in order to maximize data coverage and reserve manual processes for targeted curation projects for data unavailable anywhere except the literature. Data acquired through these pipelines include 1) basic genomic elements such as genes and accompanying map, sequence and external database identifiers, protein information, genomic positions of exons and coding regions, 2) orthologs and ortholog relationships, 3) nomenclature alerts and reviews, 4) Gene Ontology annotations for human and mouse orthologs stored in RGD as well as appropriate annotations to rat genes, 5) ontology terms and relationships for GO, Mammalian Phenotype Ontology and Pathway Ontology. The pipelines at RGD are run with either incremental updates or delete-and-reload mechanisms and are run weekly to keep data up to date and synchronized with originating data sources. Pipeline mechanisms, quality control measures, and the methods to time and synchronize multiple pipelines will be presented along with the data types acquired and integrated and the process for resolving data errors and conflicts discovered during the QC processes. Fig 1.  Synchronization of multiple pipelines at RGD assures data consistency while weekly runs providing timely updates. Fig 2.  RGD pages are populated by multiple automated pipelines for various data types. Fig 3.  Entrez Gene Pipelines automatically query the NCBI databases for gene records that have been modified during last week and make the necessary updates to RGD genes. ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],2. Orthologs and Nomenclature QC at RGD ,[object Object],[object Object],[object Object],Fig 4.  Pipeline web report pages show the results and summaries from the last runs. A set of flags is assigned to every record (gene) processed, so curator can quickly jump to group of records of interest. Fig 7.  RGD Nomenclature Curation Software 1. Gene QC at RGD Fig 6.  RGD Nomenclature Pipeline ensures nomenclature QC via proposing necessary nomenclature changes every time the tool is opened. Nomenclature Pipeline, Manual Curation GOA and Mouse and Human GO Annotation Pipelines, Manual Curation Ortholog Relationship Pipeline Entrez Gene Pipelines Entrez Gene and UniProtKB Pipelines Nomenclature Pipeline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],3. Assuring Gene Identity in Multiple RGD Pipelines   Fig 8.  The pipelines at RGD use cross referencing identifiers, which expedites data exchange between different resources and ensures gene identity. Entrez Gene Pipelines Fig 5.  The curator can browse through particular class of conflicting records. The full hyperlinked XML representation of the incoming record is provided to allow for faster conflict resolution. Marek Tutaj, Mary Shimoyama, Elizabeth A. Worthey, Jennifer Smith, Rajni Nigam, Victoria Petri, Stan Laulederkind, Timothy F. Lowry, Tom Hayman, Shur-Jen Wang, Jeff De Pons, Pushkala Jayaraman, Weisong Liu, Diane Munzenmaier, Melinda Dwinell, Simon Twigger, Howard Jacob Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin Monday Tuesday Wednesday Thursday Friday Saturday Sunday Ontology Loading Pipeline ,[object Object],[object Object],[object Object],RGD Terms Reindexing (for search engine) GOC Annotations FTP Extract Entrezgene Pipeline (rat, human, mouse) Ortholog Loading UniProtKB sprot/trembl Process GOC Annotations GO Annotation Pipeline Mouse and Human GO Annotation UniSTS Pipeline Data Release Data Release (ctd.)

More Related Content

What's hot

PheWAS_i2b2_v1.0
PheWAS_i2b2_v1.0PheWAS_i2b2_v1.0
PheWAS_i2b2_v1.0
Huan Mo
 
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Surya Saha
 
In sillico pub
In sillico pubIn sillico pub
In sillico pub
maldjuan
 
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesAgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
Surya Saha
 

What's hot (20)

Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomes
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
PheWAS_i2b2_v1.0
PheWAS_i2b2_v1.0PheWAS_i2b2_v1.0
PheWAS_i2b2_v1.0
 
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...
 
In sillico pub
In sillico pubIn sillico pub
In sillico pub
 
GIAB Sep2016 Lightning chen sun varmatch
GIAB Sep2016 Lightning chen sun varmatchGIAB Sep2016 Lightning chen sun varmatch
GIAB Sep2016 Lightning chen sun varmatch
 
Jan2016 horizon GIAB
Jan2016 horizon GIABJan2016 horizon GIAB
Jan2016 horizon GIAB
 
Updates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meetingUpdates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meeting
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic SciencesBuilding an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
 
2015 functional genomics variant annotation and interpretation- tools and p...
2015 functional genomics   variant annotation and interpretation- tools and p...2015 functional genomics   variant annotation and interpretation- tools and p...
2015 functional genomics variant annotation and interpretation- tools and p...
 
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesAgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
 
Quantified Self On Being A Personal Genomic Observatory
Quantified Self On Being A Personal Genomic ObservatoryQuantified Self On Being A Personal Genomic Observatory
Quantified Self On Being A Personal Genomic Observatory
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 

Similar to Automated data pipelines at the rat genome database

The rat genome database - genome browser
The rat genome database  - genome browserThe rat genome database  - genome browser
The rat genome database - genome browser
Jennifer Smith
 

Similar to Automated data pipelines at the rat genome database (20)

The rat genome database - genome browser
The rat genome database  - genome browserThe rat genome database  - genome browser
The rat genome database - genome browser
 
Psb tutorial cancer_pathways
Psb tutorial cancer_pathwaysPsb tutorial cancer_pathways
Psb tutorial cancer_pathways
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
D1803012022
D1803012022D1803012022
D1803012022
 
RT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationRT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferation
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data Science
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015
 
WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...
 
Qi liu 08.08.2014
Qi liu 08.08.2014Qi liu 08.08.2014
Qi liu 08.08.2014
 
170326 giab abrf
170326 giab abrf170326 giab abrf
170326 giab abrf
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
 
NCBI
NCBINCBI
NCBI
 
Genome comparision
Genome comparisionGenome comparision
Genome comparision
 
GlyGen Common Fund Glycoscience Meeting 2020
GlyGen Common Fund Glycoscience Meeting 2020GlyGen Common Fund Glycoscience Meeting 2020
GlyGen Common Fund Glycoscience Meeting 2020
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Bioinformatics principles and applications
Bioinformatics principles and applicationsBioinformatics principles and applications
Bioinformatics principles and applications
 

More from Jennifer Smith

PhenoMiner -integrating phenotype values for multiple strains
PhenoMiner -integrating phenotype values for multiple strainsPhenoMiner -integrating phenotype values for multiple strains
PhenoMiner -integrating phenotype values for multiple strains
Jennifer Smith
 
Phenotypes and models at rgd -meet joe rat
Phenotypes and models at rgd -meet joe ratPhenotypes and models at rgd -meet joe rat
Phenotypes and models at rgd -meet joe rat
Jennifer Smith
 
Resources for genomics research
Resources for genomics researchResources for genomics research
Resources for genomics research
Jennifer Smith
 
Physiological pathway diagrams at rgd
Physiological pathway diagrams at rgdPhysiological pathway diagrams at rgd
Physiological pathway diagrams at rgd
Jennifer Smith
 
Phenotypes and models portal at the rat genome database
Phenotypes and models portal at the rat genome databasePhenotypes and models portal at the rat genome database
Phenotypes and models portal at the rat genome database
Jennifer Smith
 
Ontology based phenotype database and mining tool
Ontology based phenotype database and mining toolOntology based phenotype database and mining tool
Ontology based phenotype database and mining tool
Jennifer Smith
 
Disease portals -a platform for genetic and genomic research
Disease portals -a platform for genetic and genomic researchDisease portals -a platform for genetic and genomic research
Disease portals -a platform for genetic and genomic research
Jennifer Smith
 
Collaborative development of a new vertebrate trait ontology
Collaborative development of a new vertebrate trait ontologyCollaborative development of a new vertebrate trait ontology
Collaborative development of a new vertebrate trait ontology
Jennifer Smith
 
Pathway resources at the rat genome database
Pathway resources at the rat genome databasePathway resources at the rat genome database
Pathway resources at the rat genome database
Jennifer Smith
 

More from Jennifer Smith (18)

PhenoMiner -integrating phenotype values for multiple strains
PhenoMiner -integrating phenotype values for multiple strainsPhenoMiner -integrating phenotype values for multiple strains
PhenoMiner -integrating phenotype values for multiple strains
 
Phenotypes and models at rgd -meet joe rat
Phenotypes and models at rgd -meet joe ratPhenotypes and models at rgd -meet joe rat
Phenotypes and models at rgd -meet joe rat
 
Resources for genomics research
Resources for genomics researchResources for genomics research
Resources for genomics research
 
Physiological pathway diagrams at rgd
Physiological pathway diagrams at rgdPhysiological pathway diagrams at rgd
Physiological pathway diagrams at rgd
 
Phenotypes and models portal at the rat genome database
Phenotypes and models portal at the rat genome databasePhenotypes and models portal at the rat genome database
Phenotypes and models portal at the rat genome database
 
Ontology based phenotype database and mining tool
Ontology based phenotype database and mining toolOntology based phenotype database and mining tool
Ontology based phenotype database and mining tool
 
Disease portals -a platform for genetic and genomic research
Disease portals -a platform for genetic and genomic researchDisease portals -a platform for genetic and genomic research
Disease portals -a platform for genetic and genomic research
 
Collaborative development of a new vertebrate trait ontology
Collaborative development of a new vertebrate trait ontologyCollaborative development of a new vertebrate trait ontology
Collaborative development of a new vertebrate trait ontology
 
Pathway resources at the rat genome database
Pathway resources at the rat genome databasePathway resources at the rat genome database
Pathway resources at the rat genome database
 
Phenotype Database and Data Mining at RGD
Phenotype Database and Data Mining at RGDPhenotype Database and Data Mining at RGD
Phenotype Database and Data Mining at RGD
 
Rat Models For Complex Disease
Rat Models For Complex DiseaseRat Models For Complex Disease
Rat Models For Complex Disease
 
Human QTL Data within the Rat Genome Database
Human QTL Data within the Rat Genome DatabaseHuman QTL Data within the Rat Genome Database
Human QTL Data within the Rat Genome Database
 
The Diabetes Portal at the Rat Genome Database
The Diabetes Portal at the Rat Genome DatabaseThe Diabetes Portal at the Rat Genome Database
The Diabetes Portal at the Rat Genome Database
 
RGD--A Repository and Cumulative Resource for Rat Strains
RGD--A Repository and Cumulative Resource for Rat StrainsRGD--A Repository and Cumulative Resource for Rat Strains
RGD--A Repository and Cumulative Resource for Rat Strains
 
Rat QTL Data--Linking Phenotype to the Genome
Rat QTL Data--Linking Phenotype to the GenomeRat QTL Data--Linking Phenotype to the Genome
Rat QTL Data--Linking Phenotype to the Genome
 
The Physiological Pathways Portal
The Physiological Pathways PortalThe Physiological Pathways Portal
The Physiological Pathways Portal
 
Pathway resources at the Rat Genome Database
Pathway resources at the Rat Genome DatabasePathway resources at the Rat Genome Database
Pathway resources at the Rat Genome Database
 
At RGD Education is a Two Way Street
At RGD Education is a Two Way StreetAt RGD Education is a Two Way Street
At RGD Education is a Two Way Street
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Automated data pipelines at the rat genome database

  • 1.