Tripal within the Arabidopsis Information Portal - PAG XXIII

araport.org@araport
Tripal within the Arabidopsis
Information Portal
Vivek Krishnakumar
J. Craig Venter Institute
12/11/2015
Tripal Database Network and Initiatives
PAG XXIII, San Diego, CA

araport.org@araport
Overview
•  About Araport
•  Current architecture
•  Planned implementation
– Leverage Chado schema
– Accommodate inherited data
– Serve as point of integration
– Facilitate data sharing via web services

araport.org@araport
About Araport
•  Objectives
–  Develop community web interface
•  sustainable, fundable and community-extensible
•  hosts analysis modules, visualization tools, user data
spaces
–  Practice data federation
•  integrate diverse data sets from distributed sources
•  consume and expose data via RESTful web services
–  Maintain “gold standard” Col-0 annotation
•  assemble tissue-specific transcripts from publicly available
RNA-seq datasets
•  incorporate novel coding and non-coding genes

araport.org@araport
Araport
https://www.araport.org
•  Explore data
•  ThaleMine
•  JBrowse
•  Science Apps
•  Search data
•  Quick Search
•  BLAST
•  Raw data downloads
•  Community
•  News & Events
•  Ask a question
•  Job Postings
•  Useful Links

araport.org@araport
Araport Architecture
External programsPortal (www.araport.org)
API (api.araport.org)
Agave Core
meta data
user profile
ADAMA
service manage
service enroll
a b c d e f
CGI
Computing
Storage
Databases
ThaleMine JBrowse
Authentication, metering, logging, versioning, HTTPS, CORS
a b c d e f
Apps
Jobs
Systems
CGI
InterMine
Others
Tripal
SOAP
CGI
REST
Science Apps

araport.org@araport
Current implementation
Araport data mart
Combination of flat-files and databases
•  TAIR datasets
•  Ontologies (GO, PSI)
•  Interactions (BAR)
•  Orthologs (Panther)
Data Mart
•  InterMine schema, PostgreSQL DB
•  Indexed and flattened for speed
•  Rebuilt periodically
Outputs
•  ThaleMine WebApp
•  ThaleMine web services
publish
Araport warehouse
Web services
InterMine loader live calls to…
•  UniProt web services
•  PubMed web services
publish

araport.org@araport
Planned implementation
Araport warehouse Araport data mart
Warehouse
•  Chado schema, PostgreSQL DB
•  General purpose but slow
•  Permanent host for core genomic
datasets (assembly, annotation,
metadata, etc.)
Inputs
•  Genome annotation pipeline
•  Community curation data
Outputs
•  ThaleMine WebApp
•  ThaleMine web services
publish
Data Mart
•  InterMine schema, PostgreSQL DB
•  Indexed and flattened for speed
•  Rebuilt periodically

araport.org@araport
•  Functions as our low-level (core) Araport data
warehouse
–  Preserve legacy datasets with appropriate attributions
–  Track any new datasets generated (annotation updates,
community contributions)
–  Serve as point of integration and de-duplication of
certain data types
–  Integrate with planned community curation interface
•  Supports our pursuit of being open-source (and
future-proof)
http://gmod.org/wiki/Chado

araport.org@araport
•  Drupal CMS based modularized framework,
exposing a user-friendly interface to Chado
– provides standardized loaders for genomic
datasets (FASTA, GFF3, GenBank, BLAST,
GO, InterProScan, KEGG)
– supports building custom templates and
materialized views
– exposes well documented API
http://tripal.info

araport.org@araport
Integrate data inherited from TAIR
•  Currently a combination of flat-files and TAIR’s Oracle database
–  Genome Assembly (TAIR9)
–  Genome Annotation (TAIR10): genes, pseudogenes, transposons,
ncRNAs
–  Annotation properties: gene symbols, confidence ranking, functional
descriptions, curator summary
–  GO Annotations (TAIR curated data at geneontology.org)
–  Publications (curated gene à publication relationships)
–  Variation data: Genetic markers, Polymorphisms (SNPs, TILLing) and T-
DNA Insertions
–  Stock data (lines, clones, germplasm)
•  Chado backed Tripal will serve as the core repository for this data

araport.org@araport
Integrate with planned Community
Curation Interface

araport.org@araport
Integrate publication data
•  Existing sources for publication data
–  TAIR locus to PubMed ID mapping
–  NCBI gene2pubmed mapping
–  UniProt curated Protein to PubMed ID mapping
–  Publications missing PMIDs and/or DOIs
•  Chado will act as point of integration
–  Combine and de-duplicate publication data from 3
sources (more in the future)
–  Collect and store metadata for publications with and
without PMID and/or DOIs

araport.org@araport
Integrate
Stock data
•  TAIR stock related
tables mapped to
corresponding
Chado counterpart
•  Custom loaders
developed to
perform bulk
update of Stock
information,
Phenotypes,
Polymorphism data
and mappings to
AGI locus

araport.org@araport
Role of Tripal within Araport
•  Tripal is under active development, with plans in
place to begin developing rational web services
(WS) as well as support interoperability
•  Araport plans to be involved in this working
group to satisfy the following needs of our
project:
–  Expose live data from future annotation update
pipelines to the community directly via WS
–  Expose stock data via WS in a standardized manner
to Arabidopsis stock centers (both ABRC and NASC)
to aid data synchronization
–  Embrace and support other open-source initiatives

araport.org@araport
Araport on GitHub
•  GitHub organization:
https://www.github.com/Arabidopsis-Information-Portal
•  Relevant repositories:
–  tair-chado-batchflow
–  chado_pub_loader
–  pasa-chado-hook
–  GMOD/Apollo (fork)

araport.org@araport
Acknowledgements
•  JCVI Developers
–  Maria Kim
–  Irina Belyaeva
–  Svetlana Karamycheva
•  Tripal co-PI Stephen Ficklin and development
community
•  TAIR/Phoenix Bio: assistance with data
migration
•  Funding Agencies

araport.org@araport
Chris Town, PI
Lisa McDonald
Education and
Outreach Coordinator
Chris Nelson
Project Manager
Jason Miller, Co-PI
JCVI Technical Lead
Erik Ferlanti
Software Engineer
Vivek Krishnakumar
Bioinf. Engineer
Svetlana Karamycheva
Bioinf Engineer
Eva Huala
Project lead, TAIR
Bob Muller
Technical lead, TAIR
Gos Micklem,
co-PI
Sergio Contrino
Software Engineer
Matt Vaughn
co-PI Steve Mock
Advanced Computing
Interfaces
Rion Dooley,
Web and Cloud
Services
Matt Hanlon,
Web and Mobile
Applications
Maria Kim
Bioinf
Engineer
Ben Rosen
Bioinf Analyst
Joe Stubbs,
API Developer
Platform
Walter Moreira
API Developer
Federation
Chris Jordan
Database
Manager
Eleanor Pence
Intern
Chia-Yi Cheng
Bioinf Analyst
Seth Schobel
Bioinf. Engineer
Araport Team
Irina Belyaeva
Software Engineer

araport.org@araport
THANK YOU!

araport.org@araport
Araport @ PAG XXIII
Session Details Topic(s) Presenter(s)
Tripal Database Network
and Initiatives
Sunday, January 11, 2015
5:30 PM-5:45 PM
California
W876: Tripal within the Arabidopsis Information Portal Vivek Krishnakumar
Arabidopsis Information
Portal & IAIC Workshop
Monday, January 12, 2015
12:50 PM-3:00 PM
Pacific Salon 6-7 (2nd Floor)
W059: Walkthrough the Araport Web Site
W061: Exposing Web Services for Araport
W062: Developing applications for Araport
Chia-Yi Cheng
Jason Miller
Matt Vaughn
Computer Demo 2
Tuesday, January 13, 2015
12:30 PM
California
C23: Using the Arabidopsis Information Portal Jason Miller
GMOD
Wednesday, January 14, 2015
11:30 AM
Golden West
W410: JBrowse within the Arabidopsis Information Portal Vivek Krishnakumar
Poster Session – Even
Monday, January 12, 2015
10:00 AM-11:30 AM
Grand Exhibit Hall
P0790: Data Integration for the Plant Research Community: Araport
P0792: Developing Content for the Arabidopsis Information Portal
Chia-Yi Cheng
Matt Vaughn

Tripal within the Arabidopsis Information Portal - PAG XXIII

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tripal within the Arabidopsis Information Portal - PAG XXIII

Similar to Tripal within the Arabidopsis Information Portal - PAG XXIII (20)

More from Vivek Krishnakumar

More from Vivek Krishnakumar (9)

Recently uploaded

Recently uploaded (20)

Tripal within the Arabidopsis Information Portal - PAG XXIII