SlideShare a Scribd company logo
1 of 64
Download to read offline
Managing Scientific Data: From Data
Integration to Scientific Workflows
Bertram Ludäscher
ludaesch@ucdavis.edu
San Diego
Supercomputer Center
Associate Professor
Dept. of Computer Science & Genome Center
University of California, Davis
Fellow
San Diego Supercomputer Center
University of California, San Diego
UC DAVIS
Department of
Computer Science
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Outline
• Data Integration & Mediation
• Challenges with Scientific Data
• Knowledge-based Extensions &
Ontologies
• Scientific Workflows
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
An Online Shopper’s Information Integration Problem
El Cheapo: “Where can I get the cheapest copy (including shipping cost) of
Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”
?
Information
Integration
addall.com
“One-World”
Mediation
amazon.com A1books.comhalf.combarnes&noble.com
Mediator (virtual DB)
(vs. Datawarehouse)
NOTE: non-trivial
data engineering challenges!
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
A Home Buyer’s Information Integration Problem
What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms,
a nearby school ranking in the upper third, in a neighborhood
with below-average crime rate and diverse population?
?
Information
Integration
Realtor DemographicsSchool RankingsCrime Stats
“Multiple-Worlds”
Mediation
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
A Neuroscientist’s Information
Integration Problem
What is the cerebellar distribution of rat proteins with more than
70% homology with human NCS-1? Any structure specificity?
How about other rodents?
?
Information
Integration
protein localization
(NCMIR)
neurotransmission
(SENSELAB)
sequence info
(CaPROT)
morphometry
(SYNAPSE)
“Complex
Multiple-Worlds”
Mediation
Biomedical Informatics
Research Network
http://nbirn.net
Inter-source links:
• unclear for the non-scientists
• hard for the scientist
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Interoperability & Integration Challenges
• System aspects: “Grid” Middleware
• distributed data & computing, SOA
• resource discovery, authentication, authorization
• web services, WSDL/SOAP, WSRF, OGSA, …
• (re-)sources = services, files, data sets, nodes …
• Syntax & Structure:
(XML-Based) Data Mediators
• wrapping, restructuring
• (XML) queries and views
• sources = (XML) databases
• Semantics:
Model-Based/Semantic Mediators
• conceptual models and declarative views
• Knowledge Representation: ontologies,
description logics (RDF(S),OWL ...)
• sources = knowledge bases (DB+CMs+ICs)
• Synthesis: Scientific Workflow Design &
Execution
• Composition of declarative and procedural
components into larger workflows
• (re)sources = services, processes, actors, …
Ø reconciling S5
heterogeneities
Ø “gluing” together resources
Ø bridging information and
knowledge gaps
computationally
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Information Integration Challenges:
S4 Heterogeneities
• System aspects
– platforms, devices, data & service distribution, APIs, protocols, …
è Grid middleware technologies
+ e.g. single sign-on, platform independence, transparent use of remote
resources, …
• Syntax & Structure
– heterogeneous data formats (one for each tool ...)
– heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …)
– heterogeneous schemas (one for each DB ...)
è Database mediation technologies
+ XML-based data exchange, integrated views, transparent query rewriting, …
• Semantics
– descriptive metadata, different terminologies, “hidden” semantics (context),
implicit assumptions, …
è Knowledge representation & semantic mediation technologies
+ “smart” data discovery & integration
+ e.g. ask about X (‘mafic’); find data about Y (‘diorite’); be happy anyways!
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Information Integration Challenges:
S5 Heterogeneities
• Synthesis of applications, analysis tools, data & query
components, … into “scientific workflows”
– How to put together components to solve a scientist’s problem?
è Scientific Problem Solving Environments (PSEs)
è Portals, Workbench (“scientist’s view”)
+ ontology-enhanced data registration, discovery, manipulation
+ creation and registration of new data products from existing ones,
…
è Scientific Workflow System (“engineer’s view”)
+ for designing, re-engineering, deploying analysis pipelines and
scientific workflows; a tool to make new tools …
+ e.g., creation of new datasets from existing ones, dataset
registration, …
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Information Integration from a
Database Perspective
• Information Integration Problem
– Given: data sources S1, ..., Sk (databases, web sites, ...) and user
questions Q1,..., Qn that can –in principle– be answered using the
information in the Si
– Find: the answers to Q1, ..., Qn
• The Database Perspective: source = “database”
Þ Si has a schema (relational, XML, OO, ...)
Þ Si can be queried
Þ define virtual (or materialized) integrated (or global) view G over
local sources S1 ,..., Sk using database query languages (SQL,
XQuery,...)
Þ questions become queries Qi against G(S1,..., Sk)
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
5. Post processing2. Query rewriting
Standard (XML-Based) Mediator Architecture
MEDIATOR
Integrated Global
(XML) View G
Integrated View
Definition
G(..)¬ S1(..)…Sk(..)
USER/Client
1. Query Q ( G (S1,..., Sk) )
S1
Wrapper
(XML) View
S2
Wrapper
(XML) View
Sk
Wrapper
(XML) View
web services as
wrapper APIs
3. Q1 Q2 Q3
4. {answers(Q1)} {answers(Q2)} {answers(Q3)}
6. {answers(Q)}
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Query Planning in Data Integration
• Given:
– Declarative user query Q: answer(…) :– …G ...
– … & { G :– … L … } global-as-view (GAV)
– … & { L :– … G … } local-as-view (LAV)
– … & { ic(…) :– … L … G… } integrity constraints (ICs)
• Find:
– equivalent (or minimal containing, maximal contained)
query plan Q’: answer(…) :– … L …
è query rewriting (logical/calculus, algebraic, physical
levels)
• Results:
– A variety of results/algorithms; depending on classes of
queries, views, and ICs: P, NP, … , undecidable
– hot research area in core CS (database community)
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Scientific Data Integration using
Semantic Extensions
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Example: Geologic Map Integration
• Given:
– Geologic maps from different state geological
surveys (shapefiles w/ different data schemas)
– Different ontologies:
• Geologic age ontology (e.g. USGS)
• Rock classification ontologies:
– Multiple hierarchies (chemical, fabric, texture, genesis) from
Geological Survey of Canada (GSC)
– Single hierarchy from British Geological Survey (BGS)
• Problem:
– Support uniform queries across all map
– … using different ontologies
– Support registration w/ ontology A, querying w/
ontology B
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Integration
Schema
Schema Integration
(“registering” local schemas to the global schema)
Arizona
Colorado
Utah
Nevada
Wyoming
New Mexico
Montana E.
Idaho
Montana West
Formation …
Age …
Formation …
Age …
Formation …
Age …
Formation …
Age …
Formation …
Age …
Formation …
Age …
Formation …
Age …
… Formation
… Age
… Composition
… Fabric
… Texture
… Formation
… Age
… Composition
… Fabric
… Texture
ABBREV
PERIOD
PERIOD
NAME
PERIOD
TYPE
TIME_UNIT
FMATN
PERIOD
NAME
PERIOD
NAME
FORMATION
PERIOD
FORMATION
FORMATION
LITHOLOGY
LITHOLOGY
AGE
AGE
andesitic sandstone
Livingston formation
Tertiary-
Cretaceous
Sources
Sources
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Multihierarchical Rock Classification “Ontology”
(Taxonomies) for “Thematic Queries” (GSC)
Composition
Genesis
Fabric
Texture
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Ontology-Enabled Application Example:
Geologic Map Integration
Show
formations
where AGE =
‘Paleozic’
(without age
ontology)
Show
formations
where AGE
= ‘Paleozic’
(with age
ontology)
+/- a few hundred
million years
domain
knowledge
Nevada
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Querying by Geologic Age …
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Querying by Geologic Age: Results
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Querying by Chemical Composition …
(GSC)
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Semantic Mediation (via “semantic registration” of
schemas and ontology articulations)
• Schema elements and/or data values are associated with concept
expressions from the target ontology
è conceptual queries “through” the ontology
• Articulation ontology
è source registration to A, querying through B
• Semantic mediation: query rewriting w/ ontologies
Database1
Database2
Ontology A
Ontology B
semantic
registration
ontology
articulations
Concept-based
(“semantic”)
queries
semantic
registration
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Different views on State Geological Maps
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Sedimentary Rocks: BGS Ontology
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Sedimentary Rocks: GSC Ontology
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Implementation in OWL: Not only “for the machine” …
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Source Contextualization
through Ontology Refinement
In addition to registering
(“hanging off”) data relative to
existing concepts, a source
may also refine the mediator’s
domain map...
Þ sources can register new
concepts at the mediator ...
Scientific Workflows
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Motivation:
Scientific Workflows, Pre-Cyberinfrastructure
• Data Federation & Grid “Plumbing”:
– access, move, replicate, query … data (Data-Grid)
• authenticate … SRB Sget/Sput … OPeNDAP, … Antelope/ORBs
– schedule, launch, monitor jobs (Compute-Grid)
• Globus, Condor, Nimrod, APST, …
• Data Integration:
– Conceptual querying & integration, structure & semantics, e.g. mediation w/
SQL, XQuery + OWL (Semantics-enabled Mediator)
• Data Analysis, Mining, Knowledge Discovery:
– manual/textbook (e.g. ternary diagrams), Excel, R, simulations, …
• Visualization:
– 3-D (volume), 4-D (spatio-temporal), n-D (conceptual views) …
è one-of-a-kind custom apps., detached (island) solutions
è workflows are hard to reproduce, maintain
è no/little workflow design, automation, reuse, documentation
è need for an integrated scientific workflow environment
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
What is a Scientific Workflow (SWF)?
• Model the way scientists work with their data and tools
– Mentally coordinate data export, import, analysis via software systems
• Scientific workflows emphasize data flow (≠ business workflows)
• Metadata (incl. provenance info, semantic types etc.) is crucial for
automated data ingestion, data analysis, …
• Goals:
– SWF automation,
– SWF &
component reuse,
– SWF design &
documentation
– making
scientists’ data
analysis and
management
easier!
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Some Scientific Workflow Features
• Typical requirements/characteristics:
– data-intensive and/or compute-intensive
– plumbing-intensive
– dataflow-oriented
– distribution (data, processing)
– user-interaction “in the middle”, …
– … vs. (C-z; bg; fg)-ing (“detach” and reconnect)
– advanced programming constructs (map(f), zip, takewhile, …)
– logging, provenance, “registering back” (intermediate) products
– …
• … easy to recognize a SWF when you see one!
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Promoter Identification Workflow (Napkin Drawing)
Source: Matt Coleman (LLNL)
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Promoter Identification Workflow in Kepler
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Ecology: Invasive Species Prediction
(Napkin Drawing)
Training
sample
(d)
GARP
rule set
(e)
Test sample
(d)
Integrated
layers
(native range) (c)
Species
presence &
absence points
(native range)
(a)EcoGrid
Query
EcoGrid
Query
Layer
Integration
Layer
Integration
Sample
Data
+A3
+A2
+A1
Data
Calculation
Map
Generation
Validation
User
Validation
Map
Generation
Integrated layers
(invasion area) (c)
Species presence
&absence points
(invasion area) (a)
Native
range
predictio
n
map (f)
Model quality
parameter (g)
Environmental
layers (native
range) (b)
Generate
Metadata
Archive
To Ecogrid
Registered
Ecogrid
Database
Registered
Ecogrid
Database
Registered
Ecogrid
Database
Registered
Ecogrid
Database
Environmental
layers (invasion
area) (b)
Invasion
area
prediction
map (f)
Model quality
parameter (g)
Selected
predictio
n
maps (h)
Source: NSF SEEK (Deana Pennington et. al, UNM)
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Ecological Niche Modeling in Kepler
(200 to 500 runs per species
x
2000 mammal species
x
3 minutes/run)
=
833 to 2083 days
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
GEON Analysis Workflow in KEPLER
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Commercial & Open Source Scientific Workflow and
(Dataflow) Systems & Problem Solving Environments
Kensington Discovery
Edition from InforSense
Taverna
Triana
SciRUN II
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
see!
try!
read!
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
Our Starting
Point:
Ptolemy II
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Why Ptolemy II ?
• Ptolemy II Objective:
– “The focus is on assembly of concurrent components. The key underlying
principle in the project is the use of well-defined models of computation that
govern the interaction between components. A major problem area being addressed
is the use of heterogeneous mixtures of models of computation.”
• Dataflow Process Networks w/ natural support for abstraction,
pipelining (streaming) actor-orientation, actor reuse
• User-Orientation
– Workflow design & exec console (Vergil GUI)
– “Application/Glue-Ware”
• excellent modeling and design support
• run-time support, monitoring, …
• not a middle-/underware (we use someone else’s, e.g. Globus, SRB,
…)
• but middle-/underware is conveniently accessible through actors!
• PRAGMATICS
– Ptolemy II is mature, continuously extended & improved, well-documented (500+pp)
– open source system
– many research results
– Ptolemy II participation in Kepler
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
KEPLER/CSP: Contributors, Sponsors, Projects
Ilkay Altintas SDM, NLADR, Resurgence, EOL, …
Kim Baldridge Resurgence, NMI
Chad Berkley SEEK
Shawn Bowers SEEK
Terence Critchlow SDM
Tobin Fricke ROADNet
Jeffrey Grethe BIRN
Christopher H. Brooks Ptolemy II
Zhengang Cheng SDM
Dan Higgins SEEK
Efrat Jaeger GEON
Matt Jones SEEK
Werner Krebs, EOL
Edward A. Lee Ptolemy II
Kai Lin GEON
Bertram Ludaescher SDM, SEEK, GEON, BIRN, ROADNet
Mark Miller EOL
Steve Mock NMI
Steve Neuendorffer Ptolemy II
Jing Tao SEEK
Mladen Vouk SDM
Xiaowen Xin SDM
Yang Zhao Ptolemy II
Bing Zhu SEEK
•••
Ptolemy II
www.kepler-project.org
LLNL, NCSU, SDSC, UCB, UCD, UCSB,
UCSD, U Man… Utah,…, UTEP, …, Zurich
Collab. tools: IRC, cvs, skype, Wiki: hotTopics, FAQs, ..
Ptolemy II
SPA
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
GEON Dataset Generation & Registration
(and co-development in KEPLER)
Xiaowen (SDM)
Edward et al.(Ptolemy)
Yang (Ptolemy)
Efrat
(GEON)
Ilkay
(SDM)
SQL database access (JDBC)
Matt et al.
(SEEK)
% Makefile
$> ant run
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Web Services è Actors (WS Harvester)
1
2
3
4
è “Minute-made” (MM) WS-based application integration
• Similarly: MM workflow design & sharing w/o implemented components
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Some KEPLER Actors (out of 160+ … and counting…)
This image cannot currently be displayed.
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Kepler Today: Some Numbers
• #Actors:
– Kepler: ~160 new + ~120 inherited (PTII)
– soon there can be thousands (harvested from web
services, R packages, etc.)
• #Developers:
– ~ 24+, ~10 very active; more coming… (we think :-)
• #CVS Repositories: ~2
– hopefully not increasing… :-{
• # “Production-level” WFs:
– currently ~8, expected to increase quite a bit …
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
KEPLER Tomorrow
• Application-driven extensions (here: SDM):
– access to/integration with other IDMAF components
• PnetCDF?, PVFS(2)?, MPI-IO?, parallel-R?, ASPECT?, FastBit, …
– support for execution of new SWF domains
• Astrophysics, Fusion, ….
• Further generic extensions:
– addtl. support for data-intensive and compute-intensive workflows (all SRB
Scommands, CCA support, …)
– semantics-intensive workflows
– (C-z; bg; fg)-ing (“detach” and reconnect)
– workflow deployment models
– distributed execution
• Additional “domain awareness” (esp. via new directors)
– time series, parameter sweeps, job scheduling (CONDOR, Globus, …)
– hybrid type system with semantic types (“Sparrow” extensions)
• Consolidation
– More installers, regular releases, improved usability, documentation, …
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
A User’s Wish List
• Usability
• Closing the “lid” (cf. vnc)
• Dynamic plug-in of actors (cf. actor & data
registries/repositories)
• Distributed WF execution
• Collection-based programming
• Grid awareness
• Semantics awareness
• WF Deployment (as a web site, as a web service, …)
• “Power apps” (è SciRUN II)
• …
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
hand-crafted control
solution; also: forces
sequential execution!
designed to fit
designed to fit
hand-crafted
Web-service actor
Complex backward
control-flow
No data transformations
available
[Altintas-et-al-PIW-SSDBM’03]
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
A Scientific Workflow Problem:
More Solved (Computer Scientist’s view)
• Solution based on declarative,
functional dataflow process
network
(= also a data streaming model!)
• Higher-order constructs:
map(f)
Þ no control-flow spaghetti
Þ data-intensive apps
Þ free concurrent execution
Þ free type checking
Þ automatic support to go from
piw(GeneId) to
PIW :=map(piw) over [GeneId]
map(f)-style
iterators
Powerful type
checking
Generic, declarative
“programming”
constructs
Generic data
transformation actors
Forward-only, abstractable sub-
workflow piw(GeneId)
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
A Scientific Workflow Problem:
Even More Solved (domain&CS coming together!)
map(GenbankWS)
Input: {“NM_001924”, “NM020375”}
Output: {“CAGT…AATATGAC",“GGGGA…CAAAGA“}
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Research Problem: Optimization by Rewriting
• Example: PIW as a declarative,
referentially transparent functional
process
Þ optimization via functional rewriting possible
e.g. map(f o g) = map(f) o map(g)
• Technical report &PIW specification in Haskell
map(f o g)
instead of
map(f) o map(g)
Combination of
map and zip
http://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Job Management (here: NIMROD)
• Job management infrastructure in place
• Results database: under development
• Goal: 1000’s of GAMESS jobs (quantum mechanics)
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
A Distributed Approach
Client
Servers
Computer Network
Service Locator
(Peer Discovery)
Simulation is
orchestrated in a
centralized manner
Source: Daniel Lázaro Cuadrado, Aalborg University
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Separation of Concerns
• A shining example:
– Ptolemy Directors – “factoring out” the concern of
workflow “orchestration” (MoC)
– common aspects of overall execution not left to the
actors
• Similarly:
– The “Black Box” (“flight recorder”)
• a kind of “recording central” to avoid wiring
100’s of components to recording-actor(s)
– The “Red Box” (error handling, fault tolerance)
• ………
– The “Yellow Box” (type checking)
• ………
– The “Blue Box” (shipping-and-handling)
• central handling of data transport (by value, by
reference, by scp, SRB, GridFTP, …)
SDF/PN/DE/…
Recorder
SHA @
Static Analysis
On Error
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Separation of Concerns: Port Types
• Token consumption (& production) “type”
– a director’s concern
• Token “transport type”
– by value, reference (which one), protocol (SOAP, scp,
GridFTP, scp, SRB, …)
– a SHA concern
• Structural and semantic types
– SAT (static analysis & typing) concern
– built after static unit type system…
• static unit type system as a special case!?
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
• Label data with semantic types (concept expressions from an ontology)
• Label inputs and outputs of analytical components with semantic types
Example: Data has COUNT and AREA; workflow wants DENSITY
• è via ontology, system “knows” that data can still be used
(because DENSITY := COUNT/AREA)
• Use reasoning engines to generate transformation steps
• Use reasoning engine to discover relevant components
Need for Semantic Annotations of data & actors
Data Ontology Workflow Components
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
A Scientist’s “Semantic” View of Actors
S1
(life stage property)
S2
(mortality rate
for period)
P1
P2
P4
P3 P5
Phase Observed Period Phases
Eggs
Instar I
Instar II
Instar III
Instar IV
Adults
44,000
3,513
2,529
1,922
1,461
1,300
Nymphal {Instar I, Instar II, Instar III, Instar IV}
Population samples for life stages of the
common field grasshopper [Begon et al, 1996]
Periods of development in terms of phases
life stage periods
k-value for each period
of observation
[(nymphal, 0.44)]
observations
Source: [Bowers-Ludaescher, DILS’04]
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Structural Type (XML DTD) Annotations
S1
(life stage property)
S2
(mortality rate
for period)
P1
P2
P4
P3 P5
root population = (sample)*
elem sample = (meas, lsp)
elem meas = (cnt, acc)
elem cnt = xsd:integer
elem acc = xsd:double
elem lsp = xsd:string
<population>
<sample>
<meas>
<cnt>44,000</cnt>
<acc>0.95</acc>
</meas>
<lsp>Eggs</lsp>
</sample>
…
<population>
root cohortTable = (measurement)*
elem measuremnt = (phase, obs)
elem phase = xsd:string
elem obs = xsd:integer
<cohortTable>
<measurement>
<phase>Eggs</cnt>
<obs>44,000</acc>
</measurement>
…
<cohortTable>
structType(P2) structType(P3)
Source: [Bowers-Ludaescher, DILS’04]
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
A KR+DI+Scientific Workflow Problem
• Services can be semantically compatible, but structurally
incompatible
Source
Service
Target
Service
Ps Pt
Semantic
Type Ps
Semantic
Type Pt
Structural
Type Pt
Structural
Type Ps
Desired Connection
Incompatible
Compatible
(⋠)
(⊑)
d(Ps)
d (≺)
Ontologies (OWL)
Source: [Bowers-Ludaescher, DILS’04]
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
The Ontology-Driven Framework
Source
Service
Target
Service
Ps Pt
Semantic
Type Ps
Semantic
Type Pt
Structural
Type Pt
Structural
Type Ps
Desired Connection
Compatible (⊑)
Registration
Mapping (Output)
Registration
Mapping (Input)
Correspondence
Ontologies (OWL)
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Ontology-Guided Data Transformation
Source
Service
Target
Service
Ps Pt
Semantic
Type Ps
Semantic
Type Pt
Structural
Type Pt
Structural
Type Ps
Desired Connection
Compatible (⊑)
Structural/Semantic
Association
Structural/Semantic
Association
Correspondence
Generate d(Ps)
Ontologies (OWL)
Transformation
Source: [Bowers-Ludaescher, DILS’04]
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Kepler Actor-Library w/ Concept Index
• How do you find the right component (actor)?
è Ontology-based actor organization / browsing
è Simple text-based and concept-based searching
• Next: ontology-based workflow design
Workflow
Components
(MoML)
Ontologies
(OWL)
Default + Other
Semantic
Annotations
urn ids
instance
expressions
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Scientific Workflow Design
• Support SWF design & reuse, via:
– Structural data types
– Semantic types
– Associations (=constraints) between
them
– Type checking, inference,
propagation
èSeparation of concerns:
– structure, semantics, WF
orchestration, etc.
Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
Q & A

More Related Content

Similar to Managing and Integrating Scientific Data Using Workflows and Ontologies

What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsStefan Dietze
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentationekansa
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the partsCarole Goble
 
Knowledge Graph Research and Innovation Challenges
Knowledge Graph Research and Innovation ChallengesKnowledge Graph Research and Innovation Challenges
Knowledge Graph Research and Innovation ChallengesSören Auer
 
Beyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeBeyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeEric Kansa
 
Metadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data RepositoriesMetadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data Repositoriesandrea huang
 
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI PresentationOpen Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentationekansa
 
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Paris Sud University
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebStefan Dietze
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Enrico Daga
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPCGenoveva Vargas-Solar
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeVisualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeJonathan Yu
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Stefan Dietze
 

Similar to Managing and Integrating Scientific Data Using Workflows and Ontologies (20)

What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked Datasets
 
E research overview gahegan bioinformatics workshop 2010
E research overview gahegan bioinformatics workshop 2010E research overview gahegan bioinformatics workshop 2010
E research overview gahegan bioinformatics workshop 2010
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentation
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
 
Knowledge Graph Research and Innovation Challenges
Knowledge Graph Research and Innovation ChallengesKnowledge Graph Research and Innovation Challenges
Knowledge Graph Research and Innovation Challenges
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
Beyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeBeyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional Practice
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Metadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data RepositoriesMetadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data Repositories
 
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI PresentationOpen Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
 
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPC
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeVisualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscape
 
Dwdm
DwdmDwdm
Dwdm
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)
 

More from Bertram Ludäscher

Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Bertram Ludäscher
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database RulesBertram Ludäscher
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database RulesBertram Ludäscher
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsBertram Ludäscher
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Bertram Ludäscher
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueBertram Ludäscher
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesBertram Ludäscher
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesBertram Ludäscher
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseBertram Ludäscher
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...Bertram Ludäscher
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...Bertram Ludäscher
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsBertram Ludäscher
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsBertram Ludäscher
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachBertram Ludäscher
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchBertram Ludäscher
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatBertram Ludäscher
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceBertram Ludäscher
 
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionWild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionBertram Ludäscher
 

More from Bertram Ludäscher (20)

Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query Patterns
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A Dialogue
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science Tales
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflows
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of Research
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's Seat
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable Provenance
 
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionWild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
 

Recently uploaded

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 

Recently uploaded (20)

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 

Managing and Integrating Scientific Data Using Workflows and Ontologies

  • 1. Managing Scientific Data: From Data Integration to Scientific Workflows Bertram Ludäscher ludaesch@ucdavis.edu San Diego Supercomputer Center Associate Professor Dept. of Computer Science & Genome Center University of California, Davis Fellow San Diego Supercomputer Center University of California, San Diego UC DAVIS Department of Computer Science
  • 2. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Outline • Data Integration & Mediation • Challenges with Scientific Data • Knowledge-based Extensions & Ontologies • Scientific Workflows
  • 3. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” ? Information Integration addall.com “One-World” Mediation amazon.com A1books.comhalf.combarnes&noble.com Mediator (virtual DB) (vs. Datawarehouse) NOTE: non-trivial data engineering challenges!
  • 4. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 A Home Buyer’s Information Integration Problem What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? ? Information Integration Realtor DemographicsSchool RankingsCrime Stats “Multiple-Worlds” Mediation
  • 5. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ? Information Integration protein localization (NCMIR) neurotransmission (SENSELAB) sequence info (CaPROT) morphometry (SYNAPSE) “Complex Multiple-Worlds” Mediation Biomedical Informatics Research Network http://nbirn.net Inter-source links: • unclear for the non-scientists • hard for the scientist
  • 6. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
  • 7. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Interoperability & Integration Challenges • System aspects: “Grid” Middleware • distributed data & computing, SOA • resource discovery, authentication, authorization • web services, WSDL/SOAP, WSRF, OGSA, … • (re-)sources = services, files, data sets, nodes … • Syntax & Structure: (XML-Based) Data Mediators • wrapping, restructuring • (XML) queries and views • sources = (XML) databases • Semantics: Model-Based/Semantic Mediators • conceptual models and declarative views • Knowledge Representation: ontologies, description logics (RDF(S),OWL ...) • sources = knowledge bases (DB+CMs+ICs) • Synthesis: Scientific Workflow Design & Execution • Composition of declarative and procedural components into larger workflows • (re)sources = services, processes, actors, … Ø reconciling S5 heterogeneities Ø “gluing” together resources Ø bridging information and knowledge gaps computationally
  • 8. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Information Integration Challenges: S4 Heterogeneities • System aspects – platforms, devices, data & service distribution, APIs, protocols, … è Grid middleware technologies + e.g. single sign-on, platform independence, transparent use of remote resources, … • Syntax & Structure – heterogeneous data formats (one for each tool ...) – heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …) – heterogeneous schemas (one for each DB ...) è Database mediation technologies + XML-based data exchange, integrated views, transparent query rewriting, … • Semantics – descriptive metadata, different terminologies, “hidden” semantics (context), implicit assumptions, … è Knowledge representation & semantic mediation technologies + “smart” data discovery & integration + e.g. ask about X (‘mafic’); find data about Y (‘diorite’); be happy anyways!
  • 9. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Information Integration Challenges: S5 Heterogeneities • Synthesis of applications, analysis tools, data & query components, … into “scientific workflows” – How to put together components to solve a scientist’s problem? è Scientific Problem Solving Environments (PSEs) è Portals, Workbench (“scientist’s view”) + ontology-enhanced data registration, discovery, manipulation + creation and registration of new data products from existing ones, … è Scientific Workflow System (“engineer’s view”) + for designing, re-engineering, deploying analysis pipelines and scientific workflows; a tool to make new tools … + e.g., creation of new datasets from existing ones, dataset registration, …
  • 10. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Information Integration from a Database Perspective • Information Integration Problem – Given: data sources S1, ..., Sk (databases, web sites, ...) and user questions Q1,..., Qn that can –in principle– be answered using the information in the Si – Find: the answers to Q1, ..., Qn • The Database Perspective: source = “database” Þ Si has a schema (relational, XML, OO, ...) Þ Si can be queried Þ define virtual (or materialized) integrated (or global) view G over local sources S1 ,..., Sk using database query languages (SQL, XQuery,...) Þ questions become queries Qi against G(S1,..., Sk)
  • 11. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 5. Post processing2. Query rewriting Standard (XML-Based) Mediator Architecture MEDIATOR Integrated Global (XML) View G Integrated View Definition G(..)¬ S1(..)…Sk(..) USER/Client 1. Query Q ( G (S1,..., Sk) ) S1 Wrapper (XML) View S2 Wrapper (XML) View Sk Wrapper (XML) View web services as wrapper APIs 3. Q1 Q2 Q3 4. {answers(Q1)} {answers(Q2)} {answers(Q3)} 6. {answers(Q)}
  • 12. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Query Planning in Data Integration • Given: – Declarative user query Q: answer(…) :– …G ... – … & { G :– … L … } global-as-view (GAV) – … & { L :– … G … } local-as-view (LAV) – … & { ic(…) :– … L … G… } integrity constraints (ICs) • Find: – equivalent (or minimal containing, maximal contained) query plan Q’: answer(…) :– … L … è query rewriting (logical/calculus, algebraic, physical levels) • Results: – A variety of results/algorithms; depending on classes of queries, views, and ICs: P, NP, … , undecidable – hot research area in core CS (database community)
  • 13. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Scientific Data Integration using Semantic Extensions
  • 14. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
  • 15. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Example: Geologic Map Integration • Given: – Geologic maps from different state geological surveys (shapefiles w/ different data schemas) – Different ontologies: • Geologic age ontology (e.g. USGS) • Rock classification ontologies: – Multiple hierarchies (chemical, fabric, texture, genesis) from Geological Survey of Canada (GSC) – Single hierarchy from British Geological Survey (BGS) • Problem: – Support uniform queries across all map – … using different ontologies – Support registration w/ ontology A, querying w/ ontology B
  • 16. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Integration Schema Schema Integration (“registering” local schemas to the global schema) Arizona Colorado Utah Nevada Wyoming New Mexico Montana E. Idaho Montana West Formation … Age … Formation … Age … Formation … Age … Formation … Age … Formation … Age … Formation … Age … Formation … Age … … Formation … Age … Composition … Fabric … Texture … Formation … Age … Composition … Fabric … Texture ABBREV PERIOD PERIOD NAME PERIOD TYPE TIME_UNIT FMATN PERIOD NAME PERIOD NAME FORMATION PERIOD FORMATION FORMATION LITHOLOGY LITHOLOGY AGE AGE andesitic sandstone Livingston formation Tertiary- Cretaceous Sources Sources
  • 17. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Multihierarchical Rock Classification “Ontology” (Taxonomies) for “Thematic Queries” (GSC) Composition Genesis Fabric Texture
  • 18. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Ontology-Enabled Application Example: Geologic Map Integration Show formations where AGE = ‘Paleozic’ (without age ontology) Show formations where AGE = ‘Paleozic’ (with age ontology) +/- a few hundred million years domain knowledge Nevada
  • 19. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Querying by Geologic Age …
  • 20. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Querying by Geologic Age: Results
  • 21. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Querying by Chemical Composition … (GSC)
  • 22. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Semantic Mediation (via “semantic registration” of schemas and ontology articulations) • Schema elements and/or data values are associated with concept expressions from the target ontology è conceptual queries “through” the ontology • Articulation ontology è source registration to A, querying through B • Semantic mediation: query rewriting w/ ontologies Database1 Database2 Ontology A Ontology B semantic registration ontology articulations Concept-based (“semantic”) queries semantic registration
  • 23. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Different views on State Geological Maps
  • 24. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Sedimentary Rocks: BGS Ontology
  • 25. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Sedimentary Rocks: GSC Ontology
  • 26. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Implementation in OWL: Not only “for the machine” …
  • 27. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Source Contextualization through Ontology Refinement In addition to registering (“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map... Þ sources can register new concepts at the mediator ...
  • 29. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Motivation: Scientific Workflows, Pre-Cyberinfrastructure • Data Federation & Grid “Plumbing”: – access, move, replicate, query … data (Data-Grid) • authenticate … SRB Sget/Sput … OPeNDAP, … Antelope/ORBs – schedule, launch, monitor jobs (Compute-Grid) • Globus, Condor, Nimrod, APST, … • Data Integration: – Conceptual querying & integration, structure & semantics, e.g. mediation w/ SQL, XQuery + OWL (Semantics-enabled Mediator) • Data Analysis, Mining, Knowledge Discovery: – manual/textbook (e.g. ternary diagrams), Excel, R, simulations, … • Visualization: – 3-D (volume), 4-D (spatio-temporal), n-D (conceptual views) … è one-of-a-kind custom apps., detached (island) solutions è workflows are hard to reproduce, maintain è no/little workflow design, automation, reuse, documentation è need for an integrated scientific workflow environment
  • 30. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 What is a Scientific Workflow (SWF)? • Model the way scientists work with their data and tools – Mentally coordinate data export, import, analysis via software systems • Scientific workflows emphasize data flow (≠ business workflows) • Metadata (incl. provenance info, semantic types etc.) is crucial for automated data ingestion, data analysis, … • Goals: – SWF automation, – SWF & component reuse, – SWF design & documentation – making scientists’ data analysis and management easier!
  • 31. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Some Scientific Workflow Features • Typical requirements/characteristics: – data-intensive and/or compute-intensive – plumbing-intensive – dataflow-oriented – distribution (data, processing) – user-interaction “in the middle”, … – … vs. (C-z; bg; fg)-ing (“detach” and reconnect) – advanced programming constructs (map(f), zip, takewhile, …) – logging, provenance, “registering back” (intermediate) products – … • … easy to recognize a SWF when you see one!
  • 32. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Promoter Identification Workflow (Napkin Drawing) Source: Matt Coleman (LLNL)
  • 33. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Promoter Identification Workflow in Kepler
  • 34. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Ecology: Invasive Species Prediction (Napkin Drawing) Training sample (d) GARP rule set (e) Test sample (d) Integrated layers (native range) (c) Species presence & absence points (native range) (a)EcoGrid Query EcoGrid Query Layer Integration Layer Integration Sample Data +A3 +A2 +A1 Data Calculation Map Generation Validation User Validation Map Generation Integrated layers (invasion area) (c) Species presence &absence points (invasion area) (a) Native range predictio n map (f) Model quality parameter (g) Environmental layers (native range) (b) Generate Metadata Archive To Ecogrid Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Environmental layers (invasion area) (b) Invasion area prediction map (f) Model quality parameter (g) Selected predictio n maps (h) Source: NSF SEEK (Deana Pennington et. al, UNM)
  • 35. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Ecological Niche Modeling in Kepler (200 to 500 runs per species x 2000 mammal species x 3 minutes/run) = 833 to 2083 days
  • 36. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 GEON Analysis Workflow in KEPLER
  • 37. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Commercial & Open Source Scientific Workflow and (Dataflow) Systems & Problem Solving Environments Kensington Discovery Edition from InforSense Taverna Triana SciRUN II
  • 38. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 see! try! read! Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/ Our Starting Point: Ptolemy II
  • 39. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Why Ptolemy II ? • Ptolemy II Objective: – “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.” • Dataflow Process Networks w/ natural support for abstraction, pipelining (streaming) actor-orientation, actor reuse • User-Orientation – Workflow design & exec console (Vergil GUI) – “Application/Glue-Ware” • excellent modeling and design support • run-time support, monitoring, … • not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …) • but middle-/underware is conveniently accessible through actors! • PRAGMATICS – Ptolemy II is mature, continuously extended & improved, well-documented (500+pp) – open source system – many research results – Ptolemy II participation in Kepler
  • 40. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 KEPLER/CSP: Contributors, Sponsors, Projects Ilkay Altintas SDM, NLADR, Resurgence, EOL, … Kim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEK Terence Critchlow SDM Tobin Fricke ROADNet Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEK Efrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOL Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher SDM, SEEK, GEON, BIRN, ROADNet Mark Miller EOL Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy II Bing Zhu SEEK ••• Ptolemy II www.kepler-project.org LLNL, NCSU, SDSC, UCB, UCD, UCSB, UCSD, U Man… Utah,…, UTEP, …, Zurich Collab. tools: IRC, cvs, skype, Wiki: hotTopics, FAQs, .. Ptolemy II SPA
  • 41. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 GEON Dataset Generation & Registration (and co-development in KEPLER) Xiaowen (SDM) Edward et al.(Ptolemy) Yang (Ptolemy) Efrat (GEON) Ilkay (SDM) SQL database access (JDBC) Matt et al. (SEEK) % Makefile $> ant run
  • 42. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Web Services è Actors (WS Harvester) 1 2 3 4 è “Minute-made” (MM) WS-based application integration • Similarly: MM workflow design & sharing w/o implemented components
  • 43. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Some KEPLER Actors (out of 160+ … and counting…) This image cannot currently be displayed.
  • 44. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Kepler Today: Some Numbers • #Actors: – Kepler: ~160 new + ~120 inherited (PTII) – soon there can be thousands (harvested from web services, R packages, etc.) • #Developers: – ~ 24+, ~10 very active; more coming… (we think :-) • #CVS Repositories: ~2 – hopefully not increasing… :-{ • # “Production-level” WFs: – currently ~8, expected to increase quite a bit …
  • 45. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 KEPLER Tomorrow • Application-driven extensions (here: SDM): – access to/integration with other IDMAF components • PnetCDF?, PVFS(2)?, MPI-IO?, parallel-R?, ASPECT?, FastBit, … – support for execution of new SWF domains • Astrophysics, Fusion, …. • Further generic extensions: – addtl. support for data-intensive and compute-intensive workflows (all SRB Scommands, CCA support, …) – semantics-intensive workflows – (C-z; bg; fg)-ing (“detach” and reconnect) – workflow deployment models – distributed execution • Additional “domain awareness” (esp. via new directors) – time series, parameter sweeps, job scheduling (CONDOR, Globus, …) – hybrid type system with semantic types (“Sparrow” extensions) • Consolidation – More installers, regular releases, improved usability, documentation, …
  • 46. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 A User’s Wish List • Usability • Closing the “lid” (cf. vnc) • Dynamic plug-in of actors (cf. actor & data registries/repositories) • Distributed WF execution • Collection-based programming • Grid awareness • Semantics awareness • WF Deployment (as a web site, as a web service, …) • “Power apps” (è SciRUN II) • …
  • 47. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 hand-crafted control solution; also: forces sequential execution! designed to fit designed to fit hand-crafted Web-service actor Complex backward control-flow No data transformations available [Altintas-et-al-PIW-SSDBM’03]
  • 48. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 A Scientific Workflow Problem: More Solved (Computer Scientist’s view) • Solution based on declarative, functional dataflow process network (= also a data streaming model!) • Higher-order constructs: map(f) Þ no control-flow spaghetti Þ data-intensive apps Þ free concurrent execution Þ free type checking Þ automatic support to go from piw(GeneId) to PIW :=map(piw) over [GeneId] map(f)-style iterators Powerful type checking Generic, declarative “programming” constructs Generic data transformation actors Forward-only, abstractable sub- workflow piw(GeneId)
  • 49. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 A Scientific Workflow Problem: Even More Solved (domain&CS coming together!) map(GenbankWS) Input: {“NM_001924”, “NM020375”} Output: {“CAGT…AATATGAC",“GGGGA…CAAAGA“}
  • 50. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Research Problem: Optimization by Rewriting • Example: PIW as a declarative, referentially transparent functional process Þ optimization via functional rewriting possible e.g. map(f o g) = map(f) o map(g) • Technical report &PIW specification in Haskell map(f o g) instead of map(f) o map(g) Combination of map and zip http://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf
  • 51. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Job Management (here: NIMROD) • Job management infrastructure in place • Results database: under development • Goal: 1000’s of GAMESS jobs (quantum mechanics)
  • 52. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 A Distributed Approach Client Servers Computer Network Service Locator (Peer Discovery) Simulation is orchestrated in a centralized manner Source: Daniel Lázaro Cuadrado, Aalborg University
  • 53. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Separation of Concerns • A shining example: – Ptolemy Directors – “factoring out” the concern of workflow “orchestration” (MoC) – common aspects of overall execution not left to the actors • Similarly: – The “Black Box” (“flight recorder”) • a kind of “recording central” to avoid wiring 100’s of components to recording-actor(s) – The “Red Box” (error handling, fault tolerance) • ……… – The “Yellow Box” (type checking) • ……… – The “Blue Box” (shipping-and-handling) • central handling of data transport (by value, by reference, by scp, SRB, GridFTP, …) SDF/PN/DE/… Recorder SHA @ Static Analysis On Error
  • 54. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Separation of Concerns: Port Types • Token consumption (& production) “type” – a director’s concern • Token “transport type” – by value, reference (which one), protocol (SOAP, scp, GridFTP, scp, SRB, …) – a SHA concern • Structural and semantic types – SAT (static analysis & typing) concern – built after static unit type system… • static unit type system as a special case!?
  • 55. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 • Label data with semantic types (concept expressions from an ontology) • Label inputs and outputs of analytical components with semantic types Example: Data has COUNT and AREA; workflow wants DENSITY • è via ontology, system “knows” that data can still be used (because DENSITY := COUNT/AREA) • Use reasoning engines to generate transformation steps • Use reasoning engine to discover relevant components Need for Semantic Annotations of data & actors Data Ontology Workflow Components
  • 56. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 A Scientist’s “Semantic” View of Actors S1 (life stage property) S2 (mortality rate for period) P1 P2 P4 P3 P5 Phase Observed Period Phases Eggs Instar I Instar II Instar III Instar IV Adults 44,000 3,513 2,529 1,922 1,461 1,300 Nymphal {Instar I, Instar II, Instar III, Instar IV} Population samples for life stages of the common field grasshopper [Begon et al, 1996] Periods of development in terms of phases life stage periods k-value for each period of observation [(nymphal, 0.44)] observations Source: [Bowers-Ludaescher, DILS’04]
  • 57. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Structural Type (XML DTD) Annotations S1 (life stage property) S2 (mortality rate for period) P1 P2 P4 P3 P5 root population = (sample)* elem sample = (meas, lsp) elem meas = (cnt, acc) elem cnt = xsd:integer elem acc = xsd:double elem lsp = xsd:string <population> <sample> <meas> <cnt>44,000</cnt> <acc>0.95</acc> </meas> <lsp>Eggs</lsp> </sample> … <population> root cohortTable = (measurement)* elem measuremnt = (phase, obs) elem phase = xsd:string elem obs = xsd:integer <cohortTable> <measurement> <phase>Eggs</cnt> <obs>44,000</acc> </measurement> … <cohortTable> structType(P2) structType(P3) Source: [Bowers-Ludaescher, DILS’04]
  • 58. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 A KR+DI+Scientific Workflow Problem • Services can be semantically compatible, but structurally incompatible Source Service Target Service Ps Pt Semantic Type Ps Semantic Type Pt Structural Type Pt Structural Type Ps Desired Connection Incompatible Compatible (⋠) (⊑) d(Ps) d (≺) Ontologies (OWL) Source: [Bowers-Ludaescher, DILS’04]
  • 59. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 The Ontology-Driven Framework Source Service Target Service Ps Pt Semantic Type Ps Semantic Type Pt Structural Type Pt Structural Type Ps Desired Connection Compatible (⊑) Registration Mapping (Output) Registration Mapping (Input) Correspondence Ontologies (OWL)
  • 60. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Ontology-Guided Data Transformation Source Service Target Service Ps Pt Semantic Type Ps Semantic Type Pt Structural Type Pt Structural Type Ps Desired Connection Compatible (⊑) Structural/Semantic Association Structural/Semantic Association Correspondence Generate d(Ps) Ontologies (OWL) Transformation Source: [Bowers-Ludaescher, DILS’04]
  • 61. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Kepler Actor-Library w/ Concept Index • How do you find the right component (actor)? è Ontology-based actor organization / browsing è Simple text-based and concept-based searching • Next: ontology-based workflow design Workflow Components (MoML) Ontologies (OWL) Default + Other Semantic Annotations urn ids instance expressions
  • 62. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005
  • 63. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Scientific Workflow Design • Support SWF design & reuse, via: – Structural data types – Semantic types – Associations (=constraints) between them – Type checking, inference, propagation èSeparation of concerns: – structure, semantics, WF orchestration, etc.
  • 64. Scientific Data Integration & Workflows, B. LudäscherInformatik Kolloquium, U Göttingen, 13.7.2005 Q & A