Arabidopsis Information Portal overview from Plant Biology Europe 2014

araport.org
Arabidopsis Information Portal: A
new approach to data sharing and
cooperative development
Matt Vaughn
Director, Life Sciences Computing
Texas Advanced Computing Center

araport.org
Overview
• Rationale for the AIP
• Strategic objectives
• Current state of the platform
• Data federation architecture
• Immediate future plans
• How you can participate

araport.org
The Rationale for AIP
• Loss of TAIR as a publicly funded shared
resource for data mining and basic
bioinformatics
• Centralization as a key contributing factor
– Loading of new data into database
– Development of new user experience
– Curation and annotation
– Community support mission
• AIP is designed to be de-centralized

araport.org
IAIC Workshop Design

araport.org
AIP Proposed Architecture

araport.org
• Objectives
– Develop a community web resource
• Sustainably fundable and community-extensible
• Hosts diverse analysis & visualization tools + user data spaces
– Support Federation to integrate diverse data sets from
distributed data sources
– Maintain the Col-0 gold standard annotation
• Methods
– Assimilate TAIR10 data
– Host an Arabidopsis InterMine
– Develop a strategy to allow federation
– Offer and consume well-designed RESTful web services
– Interoperate with iPlant (and other projects) wherever
possible
The AIP Strategy (1)

araport.org
The AIP Strategy (2)
• Key Design Decisions
– Centralized (but powerful) data warehousing capability PLUS
infrastructure enabling data federation
– Jbrowse as a genome browser platform
– WebApollo + Tripal for community annotation
– App store model for graphical data interfaces (complete with 3rd
party developer path)
– Data store model for data sources
– Accessible languages and frameworks
– Secure & modern single-sign on
– Web service access to Arabidopsis data for powerful
bioinformatics
– Geo-replication and high availability
– Code re-use from other projects wherever possible
– Full code release in real time via GitHub

araport.org
Araport Bill of Materials
• AIP is currently built using
– InterMine*
– Jbrowse 1.11.3*
– Drupal 7.25*
• Developer-oriented content management system
– Angular.js, Bootstrap.js and other web toolkits
– Agave Software as a Service platform
• Developed by the iPlant Collaborative
• Bulk data, metadata, authentication, HPC app & job
management, notifications & events, and more
• OAuth2 single-sign-on
– Internally-developed API manager
*With extensive customization

araport.org
Jbrowse
Currently hosts TAIR10 data; hierarchical track sets coming soon,
including updated miRNAs and their targets and epigenomic data from

araport.org
ThaleMine
Why InterMine?
① There aren’t a lot of real
Arabidopsis web services
② InterMine is a scalable,
extensible data warehouse
③ InterMine offers a rich,
extensible web application
④ InterMine offers high quality
REST APIs
⑤ InterMine is used by other
MODs
ThaleMine is an Arabidopsis-
specific deployment of InterMine

araport.org
ThaleMine provides enhanced Gene Report functionality

araport.org
Powerful Search (1)

araport.org
Powerful Search (2)

araport.org
Queries can be
stored as templates
for re-use or
modification by you
or others (if made
public)
Query Builder & Templates

araport.org
These displays share an AIP web service and are prototypes for AIP Science
Apps
Extending the AIP

araport.org
What is a Science App?
– Written in HTML/CSS/Javascript using standard
frameworks
– Presented via web browser
• Query or Analyze, Present, Persist
– Developed by AIP and/or the community
• Deployed in AIP “app store”
• Choose which ones you want installed in your Araport
“dashboard”
– Uses AIP Data Architecture
• Data services: Local and remote query/retrieval
• Data integration and aggregation services
• Computation services

araport.org
Araport Architecture
Agave Enterprise Service Bus
CLI clients,
Scripts, 3rd party
applications
Physical
resources
HPC | Files | DB
Agave Services
apps
meta
files
profile
jobssystems
Araport API
Managermanage
enroll
a b c d e f
AIP & 3rd party data
providers
API Mediators
• Simple proxy
• Mediator
• Aggregator
• Filter
• Single-sign on
• Throttling
• Unified logging
• API versioning
• Automatic
HTTPS
REST*
REST-like
SOAP
POX
Cambrian CGI

araport.org
Data API Design Details (1)
• 100% RESTful services
• Queries are JSON objects (conforming to
a JSON schema)
• To enroll a new service in API Manager
– Specify the mapping between AIP query fields
and your service
– Map common query terms to minimal
controlled vocabulary
– Describe all service-specific parameters

araport.org
Data API Details (2)
When field mapping isn’t enough:
• Code-based transformations can be
specified via
– Python
– Java
– Ruby
– Javascript
• In technical terms, this is known as
MEDIATION

araport.org
• Results returned in a standard Agave
JSON format*
– status, message, result
• Result is an array of JSON objects
• These conform to specific schemas
– drafts on AIP GitHub soon for comment
*Unless there’s an operational reason not to

araport.org
• All Data APIs will implement:
– Count: How many records found?
– Pagination: Return only subsets
– Help: Return a usage page
– Convert: JSON (native), XML, CSV, etc

araport.org
• Docker.io for packaging
• Ultra-portable dev
environment
• Wide language
support
• Implicit security
model
• Scales horizontally
for performance
• Data API is package of
metadata + a Docker file
registered with a central
arbiter service
• Also used for services
written natively for AIP
Objectives: Facile development by
end users; simple, secure
deployment to AIP systems;
reasonable performance
Araport Data Federation Architecture
AGAVE
API MANAGER
https://github.com/waltermoreira/apim

araport.org
End result: Araport Data API Store

araport.org
End result: Araport Data API Store
curl -X GET -k -v -L -b cookies
https://api.araport.org/store/site/blocks/api/listing/ajax/list.jag?action=getAllPublishe
dAPIs
{
"apis": [
{"name":"InteractionBrowser",
"provider":"vaughn",
"version":"pr2-0.1",
"context":"/data/BioAnalyticResource/interactionBrowser",
"status":"Deployed",
"thumbnailurl":"images/api-default.png",
"visibility":null,
"visibleRoles":null,
"description":"InteractionBrowser",
"apiOwner":"vaughn",
"isAdvertiseOnly":false},

araport.org
SNP data
Epigenomic data
via CoGe
RNA-seq for expression and structural
annotation
Aracyc
Co-expression data
Orthologs, trees, alignments
Various
genomes & data
sets
Community annotation using Web
Apollo and Tripal
Interactions
Plans for next 3-6 months
Developer support & training

araport.org
Feature AIP TAIR
GBrowse with TAIR10 data Yes Yes
JBrowse with TAIR10 data YES; also embedded in gene-info page) No
Epigenomic tracks from EPIC Yes No
Affymetrix expression data Yes (from BAR); embedded in gene-info pages Some but not searchable by locus
Protein interaction data Yes (from BAR; expansion planned) Similar data set; view through N-Browse
Gene-info/Locus-detail page (list data types)
gene sequence Yes Yes
CDS Yes Yes
GO annotation Yes Yes
PO and PATO 8/31/13 Legacy data Yes (8/31/13; some updates)
Curator summary Yes (TAIR; 8/31/13) Yes (8/31/13; some updates)
Computational description Yes (TAIR; 8/31/13) Yes (8/31/13; some updates)
Literature Yes; TAIR legacy, Uniprot and NCBI Yes; NCBI + some manual curation
Flexible query interface Yes No
Paywall NO YES
BLAST services Soon Yes
Web services Yes No
Data dowloads Yes Yes
Links to stock centers In progress Yes
1001 genomes SNP data In progress No
RNA-seq expression data Soon No
Updates to Col-0 sequence and annotation YES from AIP
As conceived and funded, AIP’s mission was to be a replacement for TAIR, emphasizing computational over
human curation and integrating a wider range of data types through federation. With the rebirth of TAIR
through a subscription mechanism, the roles of the two data centers in the Arabidopsis data marketplace has
become an evolving process. TAIR will continue its enrichment of Col-0 annotation through literature curation
etc. AIP will continue to aggregate and integrate data through a combination of federation via web services
and assimilation.
Relationship between AIP and TAIR

araport.org
Getting Involved with AIP
• User workshop at upcoming ICAR
• Formal developer engagement begins soon
– Developer discussion at the ICAR meeting on
conjunction with Araport Alpha release
– SDK and tutorials available thereafter
– 2-day dev workshop in Austin in Fall* 2014
• For now, send email at araport@jcvi.org
describing what you’d like to do
– We’ll reach out to you to discuss feasibility and
timelines via video conference

araport.org
Summary
• Next-generation MOD allowing community
participation in its development
• Powerful interactive query and analysis
functions available today
• Developing a data federation model
• New data sets and functions coming at a
quick pace
• Be on the lookout for participation
opportunities

araport.org
Chris Town, PI
Lisa McDonald
Education and
Outreach
Coordinator
Chris Nelson
Project
Manager
Jason Miller, Co-PI
JCVI Technical Lead
Erik Ferlanti
Software Engineer
Vivek Krishnakumar
Bioinf. Engineer
Svetlana Karamycheva
Bioinf Engineer
Eva Huala
Project lead, TAIR
Bob Muller
Technical lead, TAIR
Gos Micklem, co-PI Sergio Contrino
Software Engineer
Matt Vaughn
co-PI
Steve Mock
Portal Engineer
Rion Dooley,
API Engineer
Matt Hanlon,
Portal Engineer
Maria Kim
Bioinf Engineer
Ben Rosen
Bioinf
Analyst
Joe Stubbs,
API
Engineer
Walter Moreira,
API Engineer

araport.org
Questions?
vaughn@tacc.utexas.edu
cdtown@jcvi.org

araport.org
API Manager + Enterprise Service Bus
Araport architecture (2)
Secure, rationalized REST services
Consumer Applications
Simple
Proxy
ThaleMine, Data
integration, other
services
Cache
XML-to-
JSON
SOAP-to-
REST
CGI-to-
REST
Throttle
Legacy
API A
Legacy
API B
REST
API C
Simple
Proxy
• Single-sign on
• Throttling
• Unified
logging
• API versioning
• Mediation and
translation
• Dev-friendly
interfaces
• Rationalized
REST for
consumer
apps
Mediators

araport.org
Science Objectives
• Make more, varied data available to the
Arabidopsis (and other) communities
within a unified user experience
• Enhance the innate value of data by
offering enhanced search, retrieval, and
display capabilities
• Facilitate analysis of user data
• Enable community participation in
functional annotation

araport.org
Technical Objectives
• Deploy a responsive, flexible community-
extensible system
• Provide APIs everywhere!
• Promote and facilitate data integration
• Enable language- and region-specific
presentation of scientific content
• Meet mobile computing on its own terms

araport.org
Local vs. Data-driven Apps
Resources are local and
inherently offline. Operating
on local data using local
computing.
Resources are cloud-based and
inherently online. Multiple data
streams integrated, queried,
presented in context of broader
objective.
Photoshop Express KAYAK Pro

araport.org
Araport Bill of Materials
• Araport is currently built using
– Drupal 7.25
• Developer-oriented content management system
– Bootstrap.js and some other Javascript toolkits
– InterMine (with modifications)
– Bioinformatics infrastructure + misc. other bits
– Agave 2.0 Software as a Service platform
• Developed by iPlant Collaborative project
• Bulk data, metadata, authentication, HPC app and job
management, notifications & events, and more
• OAuth2 out of the box
• Enterprise service bus (ESB) architecture
• http://agaveapi.co/

araport.org
Agave wso2 interface
Cache (Technology TBD)
CSV
Araport APIM Architecture (1)
POLYMORPH CGI
Form
Input Key
Map
Output
Key Map
Input
Transform
Output
Transform
Listen Respond
Send Listen
Input Key
Map
Output
Key Map
Input
Transform
Output
Transform
Listen Respond
Send Listen
Araport API
Manager
JSON Query JSON Response
ElasticSearch
Remote Services
SNP by Locus REST Indel by Position REST Enroll Manage

araport.org
Araport Architecture: Use Cases (1)
• 1001 Genomes POLYMORPH tools
– Provides variation data via locus or positional
search
– Total of seven variant types available for search
– Search parameterization depends a lot on variant
type
– Example of a plain-text CGI service
– Returns results as CSV with named columns
• Objective: Transform into a RESTful API that
expects and returns rationalized JSON
http://polymorph.weigelworld.org

araport.org
Araport Architecture: Use Cases (2)
• ThaleMine
– Has native REST interface for general queries
– Has templates which can form basis of specific
services
• Objective: Offer both Intermine-native and
AIP-conformant interfaces as Data APIs
• Current path
– Enroll native services in our APIM
– Develop template-based AIP-conformant services
http://polymorph.weigelworld.org

araport.org
Data APIs: Getting Started
Service Queries Notes
BAR eFP Locus
BAR Expressologs Locus
BAR Interactions Locus
COGe Position Special case – output transform only
NASC $SERVICE Locus
SOAP based but may be offline
permanently
OrthologFinder Locus Based on a Thalemine template
POLYMORPH Locus, Position Actually seven CGI services
SUBA3 Locus
Compiling example queries, parameter mapping and description, and ideal
results for use in implementing the system

araport.org
Developing a Data API
• In order, we prefer that you have ready
• Well-documented REST
• Moderately well-documented REST
• SOAP services (plus WSDL or WADL)
• Plain Old XML
• Plaintext CGI
• HTML CGI
• No web services at all
• Work with us to enroll your services as a data
source. This will involve a minor amount of
coding.

araport.org
Computational App Model (1)
Host file
systems
Host OS
Docker.io
Centos
6.4
custom-
repo
Container
/scratch
/database
Host FS (250 GB)
TACC Corral (PB+)
sftp
Agave apps, data, jobs
REST API x JSON objects

araport.org
Science Apps: Grid View
• Current Scheme
• 2-3 column view w
draggable apps
• Apps are normal, full-
size, or collapsed
• Single app screen
• Later in 2014
• N x X grid scheme
implementing resizable
app “tiles” like one sees
in Android or Win8.x
• App SDK libraries will
have “help” for enabling
resizable design
• Multiple app screens

araport.org
• For service-specific parameters
– Provide human-readable names mapped to original
parameter names
– Offer minimal descriptive text
– Specify validation
• Cardinality
• Pattern validator (regex)
• Type (number, string, etc.)
– Indicate whether required
– Indicate whether they should be visible in a UI
– Specify reasonable default values
• Seems familiar?
– This approach is used to to abstract command line apps
– Allows automatic generation of minimally functional UI

araport.org
Data APIs: Response types (1)
• locus_relationship – pairwise
relationship between A and B
– Directionality
– Type
– Array of scores (weights, etc.)
• sequence_feature – positional attribute
– Extension of GFF model plus
– Build
– Attributes array

araport.org
Data APIs: Response types (2)
• locus_feature – key-value attributes per locus
– Optional controlled vocabulary* for keys
– Support for both slots and arrays
• raw – for returning images or other binary formats
– Source and other metadata carried in X-headers instead of
JSON result
– Outbound transformation still supported
– Not a preferred response mode
• text – returning either native service response or a
non-conformant JSON document
– Source and other metadata carried in X-headers instead of
JSON result
– Not a preferred response mode

araport.org
• Transparent caching will compensate for
transient remote service failures
• Automatic indexing of certain response
types via ElasticSearch, allowing for
sophisticated global search
– ElasticSearch allows us to index everything
we “know about” and return it quickly
– iPlant uses it to live-index >700 TB user data

araport.org
Developing an app
• Understand and document the user stories you’re
addressing with your app
• Identify all requisite data sources AND
• Help us prepare them as Data APIs
– This may involve coding
• Understand the data integration or aggregation needs
of your app
– This may involve coding
• Develop the user interface(s) for your app using our
tool kits and suggested practices
– This will involve coding.
– But you will learn tools like jQuery, Bootstrap, & D3 and will
thus be eminently employable!

Arabidopsis Information Portal overview from Plant Biology Europe 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Arabidopsis Information Portal overview from Plant Biology Europe 2014

Similar to Arabidopsis Information Portal overview from Plant Biology Europe 2014 (20)

More from Matthew Vaughn

More from Matthew Vaughn (8)

Recently uploaded

Recently uploaded (20)

Arabidopsis Information Portal overview from Plant Biology Europe 2014

Editor's Notes