Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Loading in …3
1 of 19

Tripal within the Arabidopsis Information Portal - PAG XXIII



Download to read offline

Araport plans to implement a Chado-backed data warehouse, fronted by Tripal, serving as as our core database, used to track multiple versions of genome annotation (TAIR10, Araport11, etc.), evidentiary data (used by our annotation update pipeline), metadata such as publications collated from multiple sources like TAIR, NCBI PubMed and UniProtKB (curated and unreviewed) and stock/germplasm data linked to AGI loci via their associated polymorphisms.

Related Books

Free with a 30 day trial from Scribd

See all

Tripal within the Arabidopsis Information Portal - PAG XXIII

  1. 1. Tripal within the Arabidopsis Information Portal Vivek Krishnakumar J. Craig Venter Institute 12/11/2015 Tripal Database Network and Initiatives PAG XXIII, San Diego, CA
  2. 2. Overview •  About Araport •  Current architecture •  Planned implementation – Leverage Chado schema – Accommodate inherited data – Serve as point of integration – Facilitate data sharing via web services
  3. 3. About Araport •  Objectives –  Develop community web interface •  sustainable, fundable and community-extensible •  hosts analysis modules, visualization tools, user data spaces –  Practice data federation •  integrate diverse data sets from distributed sources •  consume and expose data via RESTful web services –  Maintain “gold standard” Col-0 annotation •  assemble tissue-specific transcripts from publicly available RNA-seq datasets •  incorporate novel coding and non-coding genes
  4. 4. Araport •  Explore data •  ThaleMine •  JBrowse •  Science Apps •  Search data •  Quick Search •  BLAST •  Raw data downloads •  Community •  News & Events •  Ask a question •  Job Postings •  Useful Links
  5. 5. Araport Architecture External programsPortal ( API ( Agave Core meta data user profile ADAMA service manage service enroll a b c d e f CGI Computing Storage Databases ThaleMine JBrowse Authentication, metering, logging, versioning, HTTPS, CORS a b c d e f Apps Jobs Systems CGI InterMine Others Tripal SOAP CGI REST Science Apps
  6. 6. Current implementation Araport data mart Combination of flat-files and databases •  TAIR datasets •  Ontologies (GO, PSI) •  Interactions (BAR) •  Orthologs (Panther) Data Mart •  InterMine schema, PostgreSQL DB •  Indexed and flattened for speed •  Rebuilt periodically Outputs •  ThaleMine WebApp •  ThaleMine web services publish Araport warehouse Web services InterMine loader live calls to… •  UniProt web services •  PubMed web services publish
  7. 7. Planned implementation Araport warehouse Araport data mart Warehouse •  Chado schema, PostgreSQL DB •  General purpose but slow •  Permanent host for core genomic datasets (assembly, annotation, metadata, etc.) Inputs •  Genome annotation pipeline •  Community curation data Outputs •  ThaleMine WebApp •  ThaleMine web services publish Data Mart •  InterMine schema, PostgreSQL DB •  Indexed and flattened for speed •  Rebuilt periodically
  8. 8. •  Functions as our low-level (core) Araport data warehouse –  Preserve legacy datasets with appropriate attributions –  Track any new datasets generated (annotation updates, community contributions) –  Serve as point of integration and de-duplication of certain data types –  Integrate with planned community curation interface •  Supports our pursuit of being open-source (and future-proof)
  9. 9. •  Drupal CMS based modularized framework, exposing a user-friendly interface to Chado – provides standardized loaders for genomic datasets (FASTA, GFF3, GenBank, BLAST, GO, InterProScan, KEGG) – supports building custom templates and materialized views – exposes well documented API
  10. 10. Integrate data inherited from TAIR •  Currently a combination of flat-files and TAIR’s Oracle database –  Genome Assembly (TAIR9) –  Genome Annotation (TAIR10): genes, pseudogenes, transposons, ncRNAs –  Annotation properties: gene symbols, confidence ranking, functional descriptions, curator summary –  GO Annotations (TAIR curated data at –  Publications (curated gene à publication relationships) –  Variation data: Genetic markers, Polymorphisms (SNPs, TILLing) and T- DNA Insertions –  Stock data (lines, clones, germplasm) •  Chado backed Tripal will serve as the core repository for this data
  11. 11. Integrate with planned Community Curation Interface
  12. 12. Integrate publication data •  Existing sources for publication data –  TAIR locus to PubMed ID mapping –  NCBI gene2pubmed mapping –  UniProt curated Protein to PubMed ID mapping –  Publications missing PMIDs and/or DOIs •  Chado will act as point of integration –  Combine and de-duplicate publication data from 3 sources (more in the future) –  Collect and store metadata for publications with and without PMID and/or DOIs
  13. 13. Integrate Stock data •  TAIR stock related tables mapped to corresponding Chado counterpart •  Custom loaders developed to perform bulk update of Stock information, Phenotypes, Polymorphism data and mappings to AGI locus
  14. 14. Role of Tripal within Araport •  Tripal is under active development, with plans in place to begin developing rational web services (WS) as well as support interoperability •  Araport plans to be involved in this working group to satisfy the following needs of our project: –  Expose live data from future annotation update pipelines to the community directly via WS –  Expose stock data via WS in a standardized manner to Arabidopsis stock centers (both ABRC and NASC) to aid data synchronization –  Embrace and support other open-source initiatives
  15. 15. Araport on GitHub •  GitHub organization: •  Relevant repositories: –  tair-chado-batchflow –  chado_pub_loader –  pasa-chado-hook –  GMOD/Apollo (fork)
  16. 16. Acknowledgements •  JCVI Developers –  Maria Kim –  Irina Belyaeva –  Svetlana Karamycheva •  Tripal co-PI Stephen Ficklin and development community •  TAIR/Phoenix Bio: assistance with data migration •  Funding Agencies
  17. 17. Chris Town, PI Lisa McDonald Education and Outreach Coordinator Chris Nelson Project Manager Jason Miller, Co-PI JCVI Technical Lead Erik Ferlanti Software Engineer Vivek Krishnakumar Bioinf. Engineer Svetlana Karamycheva Bioinf Engineer Eva Huala Project lead, TAIR Bob Muller Technical lead, TAIR Gos Micklem, co-PI Sergio Contrino Software Engineer Matt Vaughn co-PI Steve Mock Advanced Computing Interfaces Rion Dooley, Web and Cloud Services Matt Hanlon, Web and Mobile Applications Maria Kim Bioinf Engineer Ben Rosen Bioinf Analyst Joe Stubbs, API Developer Platform Walter Moreira API Developer Federation Chris Jordan Database Manager Eleanor Pence Intern Chia-Yi Cheng Bioinf Analyst Seth Schobel Bioinf. Engineer Araport Team Irina Belyaeva Software Engineer
  18. 18. THANK YOU!
  19. 19. Araport @ PAG XXIII Session Details Topic(s) Presenter(s) Tripal Database Network and Initiatives Sunday, January 11, 2015 5:30 PM-5:45 PM California W876: Tripal within the Arabidopsis Information Portal Vivek Krishnakumar Arabidopsis Information Portal & IAIC Workshop Monday, January 12, 2015 12:50 PM-3:00 PM Pacific Salon 6-7 (2nd Floor) W059: Walkthrough the Araport Web Site W061: Exposing Web Services for Araport W062: Developing applications for Araport Chia-Yi Cheng Jason Miller Matt Vaughn Computer Demo 2 Tuesday, January 13, 2015 12:30 PM California C23: Using the Arabidopsis Information Portal Jason Miller GMOD Wednesday, January 14, 2015 11:30 AM Golden West W410: JBrowse within the Arabidopsis Information Portal Vivek Krishnakumar Poster Session – Even Monday, January 12, 2015 10:00 AM-11:30 AM Grand Exhibit Hall P0790: Data Integration for the Plant Research Community: Araport P0792: Developing Content for the Arabidopsis Information Portal Chia-Yi Cheng Matt Vaughn