BioDA Workshop, 8th December, 2004 at NeSC
Notes From Discussion Sessions
Session I - User Experience of OGSA-DAI
Schema Mappings – transforms or wrappers.
(Chair – Noel Kelly; Recorder – Brian Matthews)
How do you find changing the application to use OD?
- needed to produce a product quickly – to get feedback – used JDBC.
Now migrating – putting OD in front of JDBC – using federated DB2 – stored
procedures, painless to do that, but does it exercise OD? Not a lot of need for
GRID for the types of queries needed at the moment. Only advantage is if
WSDL could be exposed to client then that would be a good idea.
Component Metadata Extractor – provide a metadata name which could extract the
stored procedure. DQP extends the metadata extractor.
Ease of integration – NK – quite easy to use, after a year or so. Not using OGSA -
Globus GT2. Integration PoV – installation of OD? Had a reputation of being hard
to install, do the new wizards help? Still need to do some Jar copying, but should be
simplified – no need to modify XML files by hand. GT3 core is installed too.
Show of hands of using OD? Jaspreet – used OD a year ago – DBs with WS front
end. GDSFactories to create GDS, major problem dynamically searching 600 –
needed a WS for each DB – this could not be afforded (Scaling problem) access did
OD require DB to be on the same machine – Tom says no, this is not the fault of OD
which allows this – mySQL config . Problems with supporting SQLServer? Now
can, with some problems with the JBDC driver. Performance? Successfully
coordinated a search over 6-7 machines.
Interop with BDWorld – wrapper – good idea to wrap a DB accessed by OD – give
the interop with BDW – also integrating OD via the BDI to the BDW architecture. In
future, a more flexible way of using XML Schemas. OD could fit in different places
in the BDW arch.
Discussed desirability for consistent mechanism for exposing metadata
Session II - GRID Data Access Architecture and OGSA-DAI
(Chair – Brian Matthews; Recorder – Andrew Jones)
Issues to discuss
- Interaction with Web Services
- Role of DQP
- Various mechanisms for access including XML, flat file, etc: ways of designing
a useful higher-level API
- Screen Scraping
- Role of registry
Key issues (presentation by Arijit)
- DQP uses OQL internally
- When DQP gets integrated with OGSA-DAI a number of divergences from the
present OGSA-DAI implementation will be addressed
- DQP doesn’t currently provide any security except a login password facility, but
they plan to look at this. Note that WSRF will mean new way in which security
should be added
- Target platforms they want to target in immediate future are OMII and GT4
- Not yet known how to interoperate between OMII and WS-Security
- Plan is for OGSA-DAI to have a way of fitting in with either security model.
Plan to introduce security here by release 6.
- More complex for OGSA-DQP because want to interoperate across both kinds of
platforms. Plan to bring in security (authentication) here by release 7.
- Note it’s fairly easy to switch on message-level security in Globus; this is a
Noted that the key project that is worried about security at the moment is eHTPX.
GeneGrid will be worrying about security shortly. (Noel Kelly)
General principle: OGSA-DAI to provide level of security that will support public-
sector bioinformaticians adequately (at least as good as what they would normally
rely on). (Norman Paton)
Jaspreet: what methods are being adopted with regard to semantic data integration?
Norman: OGSA-DAI doesn’t address semantic or schematic integration explicitly.
OGSA-DQP is a distributed query evaluation mechanism but not an integration
mechanism. Any schematic or semantic integration is in the ‘user’s hands. Plan: to
allow global-as views to address primarily schematic heterogeneity (time-scale for
this not determined).
Straw poll: schema integration was seen as higher priority than XML access.
HTML screen scraping …
Noted Neil had mentioned plan to integrate some kind of screen scraping Activity,
configurable with an URL, into OGSA-DAI. So that e.g. a database could be
populated for use in OGSA-DAI. So the purpose of this is to provide the means for
linking the screen scraper into an OGSA-DAI environment. Scenario, a <perform> …
document could include transformation activities (one that is already provided is
Then Came Coffee …
Session III – Bioinformatics Requirements for Data Access and
Integration on the Grid
(Chair – Alex Gray; Recorder: Richard White)
Michael Gleaves’ lead presentation: Biotechnology techniques (used in
Bioinformatics): genomics, “transgenomics”, proteomics, structural genomics,
metabolomics, systems biology (based on the whole cell). Andrew: Plus the species
diversity level as in BDWorld. Bionformatics’ role is to collect and interpret data and
add to knowledge. What are the next problems? Discussion of this question:
Continuing use of flat files in bioinformatics; in biodiversity level, much legacy data
and low rate of data increase; at lower levels (genomics, proteomics), no old data but
huge rate of data increase; protein structural modelling produces large data sets.
Different data sources at the different levels. Data of different types need to be
brought together. Can you find the right data at the right time? Data discovery and
Arijit: MyGrid people think an issue is to link data to its provenance data, which may
be stored elsewhere, e.g. using the LSID as a link. Use this to provide an audit trail.
MyGrid has a data model, provenance data which fits this has the highest priority.
Users rarely fill in provenance data manually – need to capture it as automatically as
possible, in the lab while the experiments are being performed [Alex], or when the
workflow is run [Arijit] in the MyGrid environment. Michael: Instrument makers are
being persuaded to generate XML provenance files as they generate the data files.
Provenance data capture afterwards is second best; at data creation is best.
Michael: Next point is data transfer. GridFTP is primitive. How about SRB?
Differences in approach between SRB and OGSA-DAI: Depends on your view of the
data. SRB puts more emphasis on managing and maintaining data files, OGSA-DAI
puts emphasis on data retrieval. Are your applications expecting to receive files (and
metadata)? – if so, SRB may be best. OGSA-DAI is better when you’re trying to
retrieve different types of data from different sources, and integrate it. Neil: they’re
complementary, orthogonal and parallel at the same time [chuckles]
Michael: Are the projects’ needs met by OGSA-DAI? eHTPX involves data-mining
at only two points, therefore putting more emphasis on SRB. Tom: What limitations
do the portal systems encounter, e.g. Spice. Tim: data sources searched in sequence,
not in parallel; need to throttle back rate of firing queries at the data sources.
Noel: GeneGrid finds OGSA-DAI does everything they want, with minor exceptions
of a mathematical nature, at the moment. (May need more complex joins later.)
Session IV – OGSA-DAI Road Map and Priorities
(Chair - Richard White; Recorder - Alex Gray)
Current road map and look at the question other way. What do community want from
MPA: OCSA DAI is a flexible framework which is able to support new facilities. Can
we supply test cases with the required facilities as it gives guidance as to what the
requirements are. If we can give a good stat then it is likely to lead to a better
implantation of the requirement.
Neil road map is changing in the distance due to input from meetings like this
Currently reviewing architecture of OGSA DAI improving concurrency model,
framework architecture, better definition of extensibility points
Support for WS security profiles, stored procedures other than DB2, data transport
improvement by going beyond Grid FTP, XQUERY as a bridge, Database specific
types and SQL (virtualised resource)
Additionally – JDBC and ODBC driver for OGSA DAI, contribution process
WS-RF is yet another set of Web Service specifications
Adds WS- Resource properties
Split state and stateless services
Web services with some extensions
MPA whole bunch of services will come in with this change eg synchronous delivery
of data results.
Globus Toolkit WS-RF core going into Apache
What is right toolkit to use if starting now
MPA history shows we don’t understand distributed systems completely and there is
no simple answer to this question. Most projects do not need distributed systems
Data integration example scenarios are helpful
(looking for common patterns distributed union and join already identified)
OGSA DQP integrated
Additionally other features
Compliance with DAIS specs
Contribution to OGSA DAI by its user communities eg what features do you want
added. What model is best for this community Can we determine this and help OGSA
DAI identify its best architecture.
MPA release 6 wish list is the full list all might not be entered. Team are identifying
priorities not all will be implemented
Performance is not a top priority but improvement is difficult to identify. Measure is
do you tap fingers while waiting for response
Metadata and extent it is used - What is the metadata extractor and will it affect the
applications. Metadata extractor can be used to extract information about the
structure of the data eg size, number of access,
RJW it is not metadata about the database structure etc it is metadata about the data.
MPA is this not something users supply as additional data. We could supply this if
there is agreement as to what it is.
Metadata about stored procedures is needed to expose the stored procedure.
Provenance database. This can be supplied automatically by generation of the
information. MPA this could be supplied in future to a standard by OGSA DAI but in
future releases not 6. Do we need a standard for this ?
What are the priorities?
Neil showed a list of features in the road map project that are on the website but have
not been discussed to assign priorities. It is on project website at
Can we supply examples of use of the identified features.
Metadata capture from lab instruments.
May be possible to support WSI and WSR in future (Neil)