2006 bio it web services

http://bioteam.net
Web Services for
Bioinformatics
Chris Dwan
The BioTeam

http://bioteam.net
Totally Unscientific Impression
The vast majority of CPU cycles (clusters, SMP
machines, and grids) in the life sciences either sit
idle, or are dominated by a very few power users.
• Because:
– Most users aren’t aware of what they have
– Or, they don’t know how to use it
– Or, they’ve tried to use it, and it’s difficult
– Or, it doesn’t read their Excel data
– Or, they tried to use it last year, and it gave them incorrect
results

http://bioteam.net
Bioinformatic
s
In the XXI
Century
Lincoln Stein’s “Bioinformatics Nation”

http://bioteam.net
Convergence
• Web interfaces, currently human-
friendly, will become machine-friendly
• Data formats and interfaces will begin
to standardize
• Heterogeneous platforms,
applications, and systems will begin to
interoperate
• Machines will begin to communicate
with each other in profound and
powerful new ways.

http://bioteam.net
Computing For Science
• Many user models
• Many applications, mostly open source,
some quite proprietary
• Cooperative, collaborative, yet competitive
• Compute and data intensive
• Rapid rate of growth / change
• There is no single solution.
Many skill levels: Physicist -> MD

http://bioteam.net
No single solution

http://bioteam.net
Core Problems
• Distribution
Data and applications are created and controlled by
autonomous groups all over the world
• Biology is difficult and messy:
Large collections of data, many data types and tools
developed in a massively distributed environment.
• Research code is different from business code
Rapid development, flexibility, “interactive” development

http://bioteam.net
Web Services
The World Wide Web is more and more used for application to
application communication. The programmatic interfaces
made available are referred to as Web Services.
•WSDL (advertisement)
–Machine readable
–An “interface contract” defining what
services are available via a particular
server
•SOAP (access)
–Independent of platform, language,
and transport protocol

http://bioteam.net
Why Web Services?
• Why not?
– CORBA, RMI, Bytecodes, Relocatable libraries,
The Grid, Opportunistic computing,
metacomputing …
• Selfish benefit to both publishers and users
– Easy publishing (no interface needed)
– Choice of client (command line .. integrated
workflow environments)
– Minimal buy-in

http://bioteam.net
Web Services Adoption?
• Languages
– PERL, C, C++, Objective C, Ruby, Java,
Applescript, Python, …
• Open Source Graphical Clients
– Taverna
• Commercial SOAP / WSDL Clients
– Inforsense, Pipeline Pilot, TurboWorx, VIBE, OS
X, Mathematica, Spotfire, …

http://bioteam.net
Bioinformatic Web Services
• EBI SOAPLab, Emboss, Ensembl, …
• KEGG Pathway
• GO Gene Ontologies
• BioMOBY Objects for modeling data
• NCBI Netblast
• iNquiry Clustered tools
As more organizations adopt common standards,
those standards become more valuable

http://bioteam.net
The BioTeam
• Consulting company:
– Scientists,
Developers, IT
Professionals
• Expertise:
– Scientific, parallel,
distributed computing
– Infrastructure
– Optimization
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompresso
QuickTime™ and a
QuickTime™ and a

http://bioteam.net
BioTeam’s iNquiry
• iNquiry is two things:
– “Instant” cluster deployment kit
• Scheduler, Web Browser, integrated configuration
– Web portal for Bioinformatics
• 170+ applications pre-installed
• HTML interface
• SOAP / Web Services interface, integrated with Cluster tools
• OS X / Apple, HP, Linux, SGI, Orion Multisystems
• 190+ installations worldwide
– 170+ are Apple
– 2 -> 240 nodes

http://bioteam.net
iNquiry (2004)
• All interfaces defined by “PISE” XML
documents
– /usr/local/lib/Pise/5.a/Xml
– Other files created by scripts
HTML
PISE XML
CGI Scripts
PERL ModulesPISE Scripts
Cluster

http://bioteam.net
iNquiry Interface
blastall.xml
<pise>
<head>
<title>BLASTALL</title>
<version>2.2.1</version>
<description>with gaps</description>
<authors>Altschul, Madden, Schaeffer, Zhang, Miller, Lipman</authors>
<category>NCBI</category>
<reference>Altschul, Stephen F., Thomas L. Madden, Alejandro A.
Schaeffer,J
inghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), Gapped
BLAS
T and PSI-BLAST: a new generation of protein database search programs,
Nucleic
Acids Res. 25:3389-3402.</reference>
<doclink>http://www.ncbi.nih.gov/Education/BLASTinfo/information3.html<
/doclink>
</head>

http://bioteam.net
iNquiry Web Services
• Released, summer 2004
• Actually in use at Novartis, BMS, VBI
• Called from Perl, Java, Taverna, Inforsense,
Pipeline Pilot, VIBE, Apple Automater,
Applescript, … HTML
PISE XML
CGI Scripts
PERL ModulesPISE Scripts
Cluster
SOAP Interface
WSDL

http://bioteam.net
A Vision for Web Services – Based
Computing
Scientific Questions
Sequence Analysis, Genomic Profiling,
Computational Chemistry
Workflow Tools/Scripts
Pipeline Pilot, Perl
Web Services
Pise, InQuiry
Job Distribution/Management
LSF
Clustered ComputingClustered Computing
Web Services
Pise, InQuiry
LSF
Computational Biology/Chemistry
Computational Biology/Chemistry
Web Services
InQuiry/Pise
LSF / SGE
Expert
Users
System
Administrators
Novice
Users

http://bioteam.net
What Web Services Don’t Do
• Traditional scheduler tasks:
– Job Control
– Queuing
– Scheduling
– Failure handling

http://bioteam.net
What Web Services Do Not Do
• Semantics
– Service ‘X’ must still be
interpreted and used in
some context.
– No OMG-like object
model imposed by
default!
– In bioinformatics, other
related projects
(BioMOBY, etc) attempt
to deal with semantic
issues.

http://bioteam.net
What Web Services Do
• Standard interface to arbitrary resources
• Allow someone else to write the interface
• Allow someone else to build the infrastructure
Completely split the interface from the service
provision
Divide and conquer

http://bioteam.net
PERL Web Service Client
$res = $server->blastall_simple(
SOAP::Data->name(TICKET)->value($ticket),
SOAP::Data->name("BLOCKING")->value(0),
SOAP::Data->name("blastall")->value("blastn"),
SOAP::Data->name("query")->value("$query_id"),
SOAP::Data->name("protein_db")->value("yeast.nt"),
SOAP::Data->name("nucleotid_db")->value("yeast.nt"),
SOAP::Data->name("tmp_outfile")
->value($query_id.".blastx")
);

http://bioteam.net
Example Taverna Workflow: Running
Blast

http://bioteam.net
Inforsense Workflow - Microarray Normalization

http://bioteam.net
Pipeline Pilot Web Service Plugin

http://bioteam.net
OS X Tiger - Automator

http://bioteam.net
Re-publication
• Most high level tools
can publish their
protocols as web
services
• All can also call
published web services
• It’s turtles all the way
down.

http://bioteam.net
This can lead to difficulties

http://bioteam.net
Sneak Preview

http://bioteam.net
Excel Web Services Plugin

http://bioteam.net
Stumbling Blocks
• Pass by reference (URL)
– SOAP data bloat
– MIME encode / decode
• System security
– Inadvertent DoS attacks are
easy
• Blocking / Timeouts
– Reattach
• Complex Data Types
• Service Relocation

http://bioteam.net
Plan For Failure
• Myron Livney (U. Wisconsin, Madison)
– Condor project: 20+ years of distributed
computing
– Management (pessimistic) rather than
engineering (optimistic) assumptions.
• Scheduling is complete when the job finishes, not
when it starts.
• Double check all results
• Assume each element will fail.
• Double-schedule the critical path

http://bioteam.net
Users (Research) are the Point
• Maximize user freedom
– Let users help each other:
• shared repository of workflows, codes, etc.
• mailing lists, chat rooms,
– If at all possible, provide source code
– The key problems are social / managerial
• Technical issues are simple by comparison.
• Include all possible resources
– Never try to get in the way of your users
Assume that users know what they’re doing

http://bioteam.net
Take Home
• Biology is difficult and messy
• IT and HPC are difficult and
messy
• Federate, don’t integrate (divide
and conquer)
• Web Services (WSDL and
SOAP) are the standard of
choice.
• If your resources are sitting idle,
there is a problem, and it’s not
the users.

http://bioteam.net
Thank You
• Early adopters (iNquiry web services):
– Nathan Siemers (Bristol-Meyers Squibb)
– John Davies, Jeremy Jenkins (Novartis IBR)
– Dustin Machai (VBI)
– Tim Kunau*, Michael Heuer (CCGB, University of Minnesota)
• Collaborators & Partners:
– Tom Oinn (Taverna), Scitegic, Inforsense
• The Bioteam
– Michael Athanas, Chris Dagdigian, Stan Gloss, Bill Van Etten,
Jiesheng Zhang
• Bio-IT World / Life Sciences Expo

2006 bio it web services

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to 2006 bio it web services

Similar to 2006 bio it web services (20)

More from Chris Dwan

More from Chris Dwan (20)

Recently uploaded

Recently uploaded (20)

2006 bio it web services