Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Applications are first class services - users don’t have to specify paths to executables, working directories, etc Examples of what the operations may look like 2. Security, state management, scheduling are all configured optionally 1. Jobs can even run on a Windows box - without a scheduler 3. Different services can exist in the same container - all accessible by separate URLs 4. Clients send command line arguments and input files to services (Base64 binary)
  • 1. Configuration file - metadata (usage, etc), binary location, default arguments, parallel flag. 2. Service properties - globus and scheduler information, database information, mpi location, etc 3. Application writers don’t have to write a single line of code to expose an application as a Web service. Users pass the command line and list of files as arguments
  • Tomcat - servlet container, hosting environment Axis - SOAP engine
  • All the above operations defined by the Opal WSDL. Operations stay the same for all applications - no additional operations required as all applications can be invoked via command line
  • MEME: discover motifs (highly conserved regions) in groups of related DNA or protein sequences MAST: search sequence databases using motifs
  • ppt

    1. 1. Kepler, Opal and Gemstone Amarnath Gupta University of California San Diego
    2. 2. Changing Needs for Scientific Process <ul><li>Observe  Hypothesize  Conduct experiment  </li></ul><ul><li>Analyze data  Compare results and Conclude  </li></ul><ul><li> Predict </li></ul>Traditional Scientific Process (before computers) Yesterday…at least for some of us!
    3. 3. What’s different in today’s science? <ul><li>Observe  Hypothesize  Conduct experiment  </li></ul><ul><li>Analyze data  Compare results and Conclude  </li></ul><ul><li> Predict </li></ul>Today’s scientific process More to add to this picture: network, Grid, portals, +++ <ul><li>Observing / Data: Microscopes, telescopes, particle accelerators, X-rays, MRI’s, </li></ul><ul><ul><li>microarrays, satellite-based sensors, sensor networks, field studies… </li></ul></ul><ul><li>Analysis, Prediction / Models and model execution: Potentially large </li></ul><ul><ul><li> computation and visualization </li></ul></ul>+ + + + + + +
    4. 4. A Brief Recap <ul><li>What are Scientific Workflow Systems trying to achieve? </li></ul><ul><ul><li>Creation of a problem solving environment over distributed and mostly autonomous platforms </li></ul></ul><ul><ul><li>Seamless access to resources and services </li></ul></ul><ul><ul><li>Service composition and reuse </li></ul></ul><ul><ul><li>Scalability </li></ul></ul><ul><ul><li>Detached execution and yet, user interaction </li></ul></ul><ul><ul><li>Reliability and fault tolerance </li></ul></ul><ul><ul><li>“ Smart” re-runnability </li></ul></ul><ul><ul><li>Reproducibility of execution </li></ul></ul><ul><ul><li>Information discovery as an aid to workflow design </li></ul></ul>
    5. 5. What is Kepler? <ul><li>Derived from an earlier scientific data-flow system called Ptolemy-II, which is </li></ul><ul><ul><li>Designed to model heterogeneous, concurrent systems for engineering applications </li></ul></ul><ul><ul><li>An actor-based workflow paradigm </li></ul></ul><ul><li>Kepler adds to Ptolemy-II </li></ul><ul><ul><li>New components for scientific workflows </li></ul></ul><ul><ul><li>Structural and Semantic type management </li></ul></ul><ul><ul><li>Semantic annotation and annotation propagation mechanisms </li></ul></ul><ul><ul><li>Distributed execution capabilities </li></ul></ul><ul><ul><ul><li>Execution in a grid framework </li></ul></ul></ul><ul><ul><li>… </li></ul></ul>
    6. 6. Promoter Identification Workflow Source: Matt Coleman (LLNL)
    7. 7. Promoter Identification Workflow
    8. 10. Enter initial inputs, Run and Display results
    9. 11. Custom Output Visualizer
    10. 12. Kepler System Architecture Authentication GUI Vergil SMS Kepler Core Extensions Ptolemy … Kepler GUI Extensions… Actor&Data SEARCH Type System Ext Provenance Framework Kepler Object Manager Documentation Smart Re-run / Failure Recovery
    11. 13. What is an Actor-based Workflow? <ul><li>An actor-based workflow is a graph with three components </li></ul><ul><ul><li>Actors : passive (parameterized) programs are specified by their input and output signatures </li></ul></ul><ul><ul><ul><li>Ports : an actor has a set of input and output ports that are specified by the signature of the data tokens passing through that port </li></ul></ul></ul><ul><ul><ul><li>No call semantics </li></ul></ul></ul><ul><ul><ul><li>Attributes </li></ul></ul></ul><ul><ul><li>Dataflow connections : a connectivity specification that designates the flow of data from one actor to another </li></ul></ul><ul><ul><ul><li>Relation : an intermediate data holding station </li></ul></ul></ul><ul><ul><li>Director : an execution control model that coordinates the execution behavior of a workflow </li></ul></ul>
    12. 14. Composite Actors <ul><li>Composite actor AW </li></ul><ul><ul><li>A pair ( W,Σ W ) comprising a subworkflow W and a set of distinguished ports Σ W  freeports( W ), the i/o-signature of W </li></ul></ul><ul><ul><li>The i/o-signatures of the subworkflow W and of the composite actor AW containing W match, i.e., Σ W = ports( AW ) </li></ul></ul><ul><ul><li>An actor can be “refined” by treating it as a workflow and adding other restrictions around it </li></ul></ul><ul><li>Workflow abstraction </li></ul><ul><ul><li>One can substitute a subworkflow as a single actor </li></ul></ul><ul><ul><li>The subworkflow may have a different director than the higher-level workflow </li></ul></ul>
    13. 15. Mineral Classification Workflow
    14. 16. PointInPolygon algorithm
    15. 17. Execution Model <ul><li>Actors </li></ul><ul><ul><li>Asynchronous : Many actors can be ready to fire simultaneously </li></ul></ul><ul><ul><ul><li>Execution (&quot;firing&quot;) of a node starts when (matching) data is available at a node's input ports. </li></ul></ul></ul><ul><ul><ul><li>Locally controlled events </li></ul></ul></ul><ul><ul><ul><ul><li>Events correspond to the “firing” of an actor </li></ul></ul></ul></ul><ul><ul><ul><li>Actor: </li></ul></ul></ul><ul><ul><ul><ul><li>A single instruction </li></ul></ul></ul></ul><ul><ul><ul><ul><li>A sequence of instructions </li></ul></ul></ul></ul><ul><ul><ul><li>Actors fire when all the inputs are available </li></ul></ul></ul><ul><li>Directors are the WF Engines that </li></ul><ul><ul><li>Implement different computational models </li></ul></ul><ul><ul><li>Define the semantics of </li></ul></ul><ul><ul><ul><li>execution of actors and workflows </li></ul></ul></ul><ul><ul><ul><li>interactions between actors </li></ul></ul></ul><ul><li>Process Network (PN) Director </li></ul><ul><ul><li>Each actor executes as a separate thread or process </li></ul></ul><ul><ul><li>Data connections represent queues of unbounded size. </li></ul></ul><ul><ul><ul><li>Actors can always write to output ports, but may get suspended (blocked) on input ports without a sufficient number of data tokens. </li></ul></ul></ul><ul><ul><li>Performs buffer management, deadlock detection, allows data forks and merges </li></ul></ul>
    16. 18. The Director <ul><li>Execution Phases </li></ul><ul><ul><li>pre-initialize method of all actors </li></ul></ul><ul><ul><ul><li>Run once per workflow execution </li></ul></ul></ul><ul><ul><ul><li>Are the data types of all actor ports known? Are transport protocols known? </li></ul></ul></ul><ul><ul><li>type-check </li></ul></ul><ul><ul><ul><li>Are connected actors type compatible? </li></ul></ul></ul><ul><ul><li>run* </li></ul></ul><ul><ul><ul><li>initialize </li></ul></ul></ul><ul><ul><ul><ul><li>Executed per run </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Are all the external services (e.g., web services) working? </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Replace dead services with live ones… </li></ul></ul></ul></ul><ul><ul><ul><li>iteration* </li></ul></ul></ul><ul><ul><ul><ul><li>pre-fire </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Are all data in place? </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><li>fire* </li></ul></ul></ul></ul><ul><ul><ul><ul><li>post-fire </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Any updates for local state management? </li></ul></ul></ul></ul></ul><ul><ul><li>wrap-up </li></ul></ul>
    17. 19. Polymorphic Actors: Components Working Across Data Types and Domains <ul><li>Actor Data Polymorphism : </li></ul><ul><ul><li>Add numbers (int, float, double, Complex) </li></ul></ul><ul><ul><li>Add strings (concatenation) </li></ul></ul><ul><ul><li>Add complex types (arrays, records, matrices) </li></ul></ul><ul><ul><li>Add user-defined types </li></ul></ul><ul><li>Actor Behavioral Polymorphism : </li></ul><ul><ul><li>In dataflow , add when all connected inputs have data </li></ul></ul><ul><ul><li>In a time-triggered model , add when the clock ticks </li></ul></ul><ul><ul><li>In discrete-event , add when any connected input has data, and add in zero time </li></ul></ul><ul><ul><li>In process networks , execute an infinite loop in a thread that blocks when reading empty inputs </li></ul></ul><ul><ul><li>In CSP , execute an infinite loop that performs rendezvous on input or output </li></ul></ul><ul><ul><li>In push/pull , ports are push or pull (declared or inferred) and behave accordingly </li></ul></ul><ul><ul><li>In real-time CORBA* , priorities are associated with ports and a dispatcher determines when to add </li></ul></ul>By not choosing among these when defining the component, we get a huge increment in component re-usability. But how do we ensure that the component will work in all these circumstances? Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/
    18. 20. GEON: Geosciences Network <ul><li>Multi-institution collaboration between IT and Earth Science researchers </li></ul><ul><li>Funded by NSF “large” ITR program </li></ul><ul><li>GEON Cyberinfrastructure provides: </li></ul><ul><ul><li>Authenticated access to data and Web services </li></ul></ul><ul><ul><li>Registration of data sets and tools, with metadata </li></ul></ul><ul><ul><li>Search for data, tools, and services, using ontologies </li></ul></ul><ul><ul><li>Scientific workflow environment </li></ul></ul><ul><ul><li>Data and map integration capability </li></ul></ul><ul><ul><li>Visualization and GIS mapping </li></ul></ul>www.geongrid.org
    19. 21. LiDAR Introduction R. Haugerud, U.S.G.S D. Harding, NASA Point Cloud x, y, z n , … Survey Process & Classify Analyze / “Do Science” Interpolate / Grid
    20. 22. LiDAR Difficulties <ul><li>Massive volumes of data </li></ul><ul><ul><li>1000s of ASCII files </li></ul></ul><ul><ul><li>Hard to subset </li></ul></ul><ul><ul><li>Hard to distribute and interpolate </li></ul></ul><ul><li>Analysis requires high performance computing </li></ul><ul><li>Traditionally: Popularity > Resources </li></ul>
    21. 23. A Three-Tier Architecture <ul><li>GOAL: Efficient LiDAR interpolation and analysis using GEON infrastructure and tools </li></ul><ul><ul><li>GEON Portal </li></ul></ul><ul><ul><li>Kepler Scientific Workflow System </li></ul></ul><ul><ul><li>GEON Grid </li></ul></ul><ul><li>Use scientific workflows to glue/combine different tools and the infrastructure </li></ul>Portal Grid
    22. 24. Lidar Workflow Process <ul><li>Configuration phase </li></ul><ul><li>Subset : DB2 query on DataStar </li></ul>Portal Grid Subset <ul><li>Interpolate : Grass RST, Grass IDW, GMT… </li></ul><ul><li>Visualize : Global Mapper, FlederMaus, ArcIMS </li></ul>Scheduling/ Output Processing Monitoring/ Translation Analyze move process Visualize move render display
    23. 25. Lidar Processing Workflow (using Fledermaus ) Subset NFS Mounted Disk d1 d2 (grid file) d2 d2 d1 Analyze move process Visualize move render display Arizona Cluster IBM DB2 Datastar NFS Mounted Disk d1 iView3D/Browser Create Scene file Fledermaus sd
    24. 26. Lidar Workflow Portlet <ul><li>User selections from GUI </li></ul><ul><ul><li>Translated into a query and a parameter file </li></ul></ul><ul><ul><li>Uploaded to remote machine </li></ul></ul><ul><li>Workflow description created on the fly </li></ul><ul><li>Workflow response redirected back to portlet </li></ul>
    25. 27. Render Map DB2 DB2 Spatial query Client/ GEON Portal NFS Mounted Disk Compute Cluster x,y,z and attribute raw data process output KEPLER WORKFLOW Parameter xml Create Workflow description Map onto the grid (Pegasus) Binary grid ASCII grid Text file Tiff/Jpeg/Gif ASCII grid LIDAR POST-PROCESSING WORKFLOW PORTLET ArcInfo Map Parameters Grass Functions submit ArcSDE ArcIMS Grass surfacing algorithms: Spline IDW block mean … Download data
    26. 28. Portlet User Interface - Main Page
    27. 29. Portlet User Interface - Parameter Entry 1
    28. 30. Portlet User Interface - Parameter Entry 2
    29. 31. Portlet User Interface - Parameter Entry 3
    30. 32. Behind the Scenes: Workflow Template
    31. 33. Filled Template
    32. 34. Example Outputs
    33. 35. With Additional Algorithms
    34. 36. Kepler System Architecture Authentication GUI Vergil SMS Kepler Core Extensions Ptolemy … Kepler GUI Extensions… Actor&Data SEARCH Type System Ext Provenance Framework Kepler Object Manager Documentation Smart Re-run / Failure Recovery
    35. 37. The Hybrid Type System <ul><li>Every portal of an actor has a type signature </li></ul><ul><ul><li>Structural Types </li></ul></ul><ul><ul><ul><li>Any type system admitted by the actor </li></ul></ul></ul><ul><ul><ul><ul><li>DBMS data types, XML schema, Hindley-Milner type system … </li></ul></ul></ul></ul><ul><ul><li>Semantic Types </li></ul></ul><ul><ul><ul><li>An expression in a logical language to specify what a data object means </li></ul></ul></ul><ul><ul><ul><li>In the SEEK project, such a statement is expressed in a DL over an ontology </li></ul></ul></ul><ul><ul><ul><ul><li>MEASUREMENT  ITEM_MEASURED . SPECIES_OCCURRENCE </li></ul></ul></ul></ul><ul><li>A workflow is well-typed if </li></ul><ul><ul><li>For every pair of connected ports </li></ul></ul><ul><ul><li>The structural type of the output port is a subtype of that of the input port </li></ul></ul><ul><ul><li>The semantic type of the output port is logically subsumed by that of the input port </li></ul></ul>
    36. 38. Hybridization Constraints <ul><li>A hybridization constraint </li></ul><ul><ul><li>a logical expression connecting instances of a structural type with instances of the corresponding semantic type for a port </li></ul></ul><ul><ul><li>For a relational type r(site, day, spp, occ) </li></ul></ul><ul><li>I/O Constraint </li></ul><ul><ul><li>A constraint relating the input and output port signatures of an actor </li></ul></ul><ul><li>Propagating hybridization constraints </li></ul>Having a tuple in r implies that there is a measurement y of the type speciesoccurrence corresponding to x occ
    37. 40. How can my (grid) application become a Kepler actor? <ul><li>By making it a web service </li></ul><ul><ul><li>For applications that have a command line interface </li></ul></ul><ul><ul><li>OPAL can convert the application into a web service </li></ul></ul><ul><li>What is Opal? </li></ul><ul><ul><li>a Web services wrapper toolkit </li></ul></ul><ul><ul><ul><li>Pros: Generic, rapid deployment of new services </li></ul></ul></ul><ul><ul><ul><li>Cons: Less flexible implementation, weak data typing due to use of generic XML schemas </li></ul></ul></ul>
    38. 41. Opal is an Application Wrapping Service Condor pool SGE Cluster PBS Cluster Globus Globus Globus Application Services Security Services (GAMA) State Mgmt Gemstone PMV/Vision Kepler
    39. 42. The Opal Toolkit: Overview <ul><li>Enables rapid deployment of scientific applications as Web services (< 2 hours) </li></ul><ul><li>Steps </li></ul><ul><ul><li>Application writers create configuration file(s) for a scientific application </li></ul></ul><ul><ul><li>Deploy the application as a Web service using Opal’s simple deployment mechanism (via Apache Ant) </li></ul></ul><ul><ul><li>Users can now access this application as a Web service via a unique URL </li></ul></ul>
    40. 43. Opal Architecture Tomcat Container Axis Engine Opal WS Opal WS Cluster/Grid Resources Container Properties Service Config Scheduler, Security, Database Setups Binary, Metadata, Arguments
    41. 44. Service Operations <ul><li>Get application metadata: Returns metadata specified inside the application configuration </li></ul><ul><li>Launch job: Accepts list of arguments and input files (Base64 encoded), launches the job, and returns a jobID </li></ul><ul><li>Query job status: Returns status of running job using the jobID </li></ul><ul><li>Get job outputs: Returns the locations of job outputs using the jobID </li></ul><ul><li>Get output as Base64: Returns an output file in Base64 encoded form </li></ul><ul><li>Destroy job: Uses the jobID to destroy a running job </li></ul>
    42. 45. MEME+MAST Workflow using Kepler
    43. 46. Kepler Opal Web Services Actor
    44. 47. Opal and Gemstone
    45. 48. Opal Summary <ul><li>Opal enables rapidly exposing legacy applications as Web services </li></ul><ul><ul><li>Provides features like Job management, Scheduling, Security, and Persistence </li></ul></ul><ul><li>More information, downloads, documentation: </li></ul><ul><ul><li>http://nbcr.net/services/ </li></ul></ul>
    46. 49. Kepler System Architecture Authentication GUI Vergil SMS Kepler Core Extensions Ptolemy … Kepler GUI Extensions… Actor&Data SEARCH Type System Ext Provenance Framework Kepler Object Manager Documentation Smart Re-run / Failure Recovery
    47. 50. Joint Authentication Framework <ul><li>Requirements: </li></ul><ul><ul><li>Coordinating between the different security architectures </li></ul></ul><ul><ul><ul><li>GEON uses GAMA which requires a single certificate authority. </li></ul></ul></ul><ul><ul><ul><li>SEEK uses LDAP with has a centralized certificate authority with distributed subordinate CAS </li></ul></ul></ul><ul><ul><li>To connect LDAP with GAMA </li></ul></ul><ul><ul><li>Coordinating between 2 different GAMA servers </li></ul></ul><ul><ul><li>Single sign-on/authentication at the initialize step of the run for multiple actors that are using authentication </li></ul></ul><ul><ul><ul><li>This has issues related to single GAMA repository vs multiple, and requires users to have accounts on all servers. </li></ul></ul></ul><ul><ul><ul><li>Kepler needs to be able to handle expired certificates for long-running workflows and/or for users who use it for a long time. </li></ul></ul></ul><ul><ul><ul><li>A trust relation between the different GAMA servers must be established in order to allow for single authentication. </li></ul></ul></ul>
    48. 51. Functional Prototype Completed <ul><li>APIs and tests cases in place </li></ul><ul><li>More work required on certificate renewal and multiple server access </li></ul>
    49. 52. Vergil is the GUI for Kepler <ul><li>Actor ontology and semantic search for actors </li></ul><ul><li>Search -> Drag and drop -> Link via ports </li></ul><ul><li>Metadata-based search for datasets </li></ul>Actor Search Data Search
    50. 53. Back to Kepler - Actor Search <ul><li>Kepler Actor Ontology </li></ul><ul><ul><li>Used in searching actors and creating conceptual views (= folders) </li></ul></ul><ul><li>Currently 160 Kepler actors added! </li></ul>
    51. 54. Data Search and Usage of Results <ul><li>Kepler DataGrid </li></ul><ul><ul><li>Discovery of data resources through local and remote services </li></ul></ul><ul><ul><ul><li>SRB, </li></ul></ul></ul><ul><ul><ul><li>Grid and Web Services, </li></ul></ul></ul><ul><ul><ul><li>Db connections </li></ul></ul></ul><ul><ul><li>Registry of datasets on the fly using workflows </li></ul></ul>
    52. 55. Vergil Updates <ul><li>To make it more useful to the user </li></ul><ul><ul><li>Updated actor icons </li></ul></ul><ul><ul><li>Menu redesign </li></ul></ul><ul><li>Improve readability </li></ul><ul><li>Develop cohesive visual language </li></ul><ul><li>Follow standard HCI principles </li></ul><ul><li>Improve organization </li></ul>Composite DB Query Computation or Operation Transformation Filter File Operation Web Service
    53. 56. Kepler Archives <ul><li>Purpose : Encapsulate WF data and actors in an archive file </li></ul><ul><ul><li>… inlined or by reference </li></ul></ul><ul><ul><li>… version control </li></ul></ul><ul><ul><ul><li>More robust workflow exchange </li></ul></ul></ul><ul><ul><ul><li>Easy management of semantic annotations </li></ul></ul></ul><ul><ul><ul><li>Plug-in architecture (Drop in and use) </li></ul></ul></ul><ul><ul><ul><li>Easy documentation updates </li></ul></ul></ul><ul><li>A jar-like archive file (.kar) including a manifest </li></ul><ul><li>All entities have unique ids (LSID) </li></ul><ul><li>Custom object manager and class loader </li></ul><ul><li>UI and API to create, define, search and load .kar files </li></ul>
    54. 57. KAR File Example <entity name=&quot;Multiply or Divide&quot; class=&quot;ptolemy.kernel.ComponentEntity&quot;> <property name=&quot;entityId&quot; value=&quot;urn:lsid:localhost:actor:80:1&quot; class=&quot;org.kepler.moml.NamedObjId&quot;/> <property name=&quot;documentation&quot; class=&quot;org.kepler.moml.DocumentationAttribute &quot; ></property> <property name=&quot;class&quot; value=&quot;ptolemy.actor.lib.MultiplyDivide&quot; class=&quot;ptolemy.kernel.util.StringAttribute&quot;> <property name=&quot;id&quot; value=&quot;urn:lsid:localhost:class:955:1 &quot; class=&quot;ptolemy.kernel.util.StringAttribute&quot;/></property> <property name=&quot;multiply&quot; class=&quot;org.kepler.moml.PortAttribute&quot;> <property name=&quot;direction&quot; value=&quot;input&quot; class=&quot;ptolemy.kernel.util.StringAttribute&quot;/> <property name=&quot;dataType&quot; value=&quot;unknown&quot; class=&quot;ptolemy.kernel.util.StringAttribute&quot;/> <property name=&quot;isMultiport&quot; value=&quot;true&quot; class=&quot;ptolemy.kernel.util.StringAttribute&quot;/></property> <property name=&quot;divide&quot; class=&quot;org.kepler.moml.PortAttribute&quot;> <property name=&quot;direction&quot; value=&quot;input&quot; class=&quot;ptolemy.kernel.util.StringAttribute&quot;/> <property name=&quot;dataType&quot; value=&quot;unknown&quot; class=&quot;ptolemy.kernel.util.StringAttribute&quot;/> <property name=&quot;isMultiport&quot; value=&quot;true&quot; class=&quot;ptolemy.kernel.util.StringAttribute&quot;/> </property> <property name=&quot;output&quot; class=&quot;org.kepler.moml.PortAttribute&quot;> <property name=&quot;direction&quot; value=&quot;output&quot; class=&quot;ptolemy.kernel.util.StringAttribute&quot;/> <property name=&quot;dataType&quot; value=&quot;unknown&quot; class=&quot;ptolemy.kernel.util.StringAttribute&quot;/> <property name=&quot;isMultiport&quot; value=&quot;false&quot; class=&quot;ptolemy.kernel.util.StringAttribute&quot;/></property> <property name=&quot;semanticType00&quot; value=&quot;http://seek.ecoinformatics.org/ontology#ArithmeticMathOperationActor&quot; class=&quot;org.kepler.sms.SemanticType&quot;/> </entity>
    55. 58. Kepler Object Manager <ul><li>Designed to access local and distributed objects </li></ul><ul><li>Objects: data, metadata, annotations, actor classes, supporting libraries, native libraries, etc. archived in kar files </li></ul><ul><li>Advantages: </li></ul><ul><ul><li>Reduce the size of Kepler distribution </li></ul></ul><ul><ul><ul><li>Only ship the core set of generic actors and domains </li></ul></ul></ul><ul><ul><li>Easy exchange of full or partial workflows for collaborations </li></ul></ul><ul><ul><li>Publish full workflows with their bound data </li></ul></ul><ul><ul><ul><li>Becomes a provenance system for derived data objects </li></ul></ul></ul><ul><ul><li>=> Separate workflow repository and distributions easily </li></ul></ul>
    56. 59. Initial Work on Provenance Framework <ul><li>Provenance </li></ul><ul><ul><li>Track origin and derivation information about scientific workflows, their runs and derived information (datasets, metadata…) </li></ul></ul><ul><li>Need for Provenance </li></ul><ul><ul><li>Association of process and results </li></ul></ul><ul><ul><li>reproduce results </li></ul></ul><ul><ul><li>“ explain & debug” results (via lineage tracing, parameter settings, …) </li></ul></ul><ul><ul><li>optimize: “Smart Re-Runs” </li></ul></ul><ul><li>Types of Provenance Information : </li></ul><ul><ul><li>Data provenance </li></ul></ul><ul><ul><ul><li>Intermediate and end results including files and db references </li></ul></ul></ul><ul><ul><li>Process (=workflow instance) provenance </li></ul></ul><ul><ul><ul><li>Keep the wf definition with data and parameters used in the run </li></ul></ul></ul><ul><ul><li>Error and execution logs </li></ul></ul><ul><ul><li>Workflow design provenance (quite different) </li></ul></ul><ul><ul><ul><li>WF design is a (little supported) process (art, magic, …) </li></ul></ul></ul><ul><ul><ul><li>for free via cvs: edit history </li></ul></ul></ul><ul><ul><ul><li>need more “structure” (e.g. templates) for individual & collaborative workflow design </li></ul></ul></ul>
    57. 60. Kepler Provenance Recording Utility <ul><li>Parametric and customizable </li></ul><ul><ul><li>Different report formats </li></ul></ul><ul><ul><li>Variable levels of detail </li></ul></ul><ul><ul><ul><li>Verbose-all, verbose-some, medium, on error </li></ul></ul></ul><ul><ul><li>Multiple cache destinations </li></ul></ul><ul><li>Saves information on </li></ul><ul><ul><li>User name, Date, Run, etc… </li></ul></ul>
    58. 61. Provenance: Possible Next Steps <ul><li>Kepler Provenance </li></ul><ul><ul><li>Deciding on terms and definitions </li></ul></ul><ul><ul><li>.kar file generation, registration and search for provenance information </li></ul></ul><ul><ul><li>Possible data/metadata formats </li></ul></ul><ul><ul><li>Automatic report generation from accumulated data </li></ul></ul><ul><ul><li>A GUI to keep track of the changes </li></ul></ul><ul><ul><li>Adding provenance repositories </li></ul></ul><ul><ul><li>A relational schema for the provenance info in addition to the existing XML </li></ul></ul>
    59. 62. What other system functions does provenance relate to? <ul><li>Failure recovery </li></ul><ul><li>Smart re-runs </li></ul><ul><li>Semantic extensions </li></ul><ul><li>Kepler Data Grid </li></ul><ul><li>Reporting and Documentation </li></ul><ul><li>Authentication </li></ul><ul><li>Data registration </li></ul>Re-run only the updated/failed parts Guided documentation generation an updates
    60. 63. Where Kepler Meets the Grid <ul><li>Abstract Grid workflow actors </li></ul><ul><ul><li>Stage-execute-fetch (sub-)workflows </li></ul></ul><ul><ul><ul><li>Copy files from one resource to computation node </li></ul></ul></ul><ul><ul><ul><li>Perform execution – possibly through a grid job scheduler </li></ul></ul></ul><ul><ul><ul><li>Get the result files back and continue with the rest of the workflow </li></ul></ul></ul><ul><ul><li>Actors </li></ul></ul><ul><ul><ul><li>Authenticate actor </li></ul></ul></ul><ul><ul><ul><ul><li>over Globus Grid, SRB and databases </li></ul></ul></ul></ul><ul><ul><ul><li>Copy actor </li></ul></ul></ul><ul><ul><ul><ul><li>for both stage and fetch </li></ul></ul></ul></ul><ul><ul><ul><li>Job executor actor </li></ul></ul></ul><ul><ul><ul><ul><li>special wrappers for ssh -based execution, web service-clients, Grid job runner proxies, and actors for Nimrod- and APST-based submissions </li></ul></ul></ul></ul>
    61. 64. Where Kepler Meets the Grid <ul><ul><ul><li>Monitoring actor </li></ul></ul></ul><ul><ul><ul><ul><li>Light monitoring: user notified on actor failure (e.g. NIMROD) upon completion of actor failure </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Medium monitoring: same with immediate notification </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Heavy monitoring: notifies every communication including immediate actor failure </li></ul></ul></ul></ul><ul><ul><ul><li>Filter actor </li></ul></ul></ul><ul><ul><ul><ul><li>Filtering and subsetting remote data of different formats </li></ul></ul></ul></ul><ul><ul><ul><li>Data Discovery actor </li></ul></ul></ul><ul><ul><ul><li>Service Discovery actor </li></ul></ul></ul><ul><ul><ul><li>Storage actor </li></ul></ul></ul><ul><ul><ul><li>Transformation and Query actors </li></ul></ul></ul><ul><ul><ul><ul><li>Shim generation </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Querying of databases and mediators </li></ul></ul></ul></ul>
    62. 65. Hot Topics in Kepler <ul><li>http://kepler-project.org/Wiki.jsp?page=HotTopics </li></ul>
    63. 66. To Sum Up <ul><li>… is an open-source system and collaboration </li></ul><ul><ul><li>is a ~3 year-old project </li></ul></ul><ul><ul><li>grows by application pull from contributors </li></ul></ul><ul><ul><li>most topics are designed jointly </li></ul></ul><ul><ul><li>is developed by multiple developers under different projects in different countries </li></ul></ul><ul><ul><li>Is now being used in actual scientific research </li></ul></ul><ul><li>The screen shots were results of initial success! </li></ul><ul><li>There is a lot more to cover and work on… </li></ul><ul><ul><li>New foci at SDSC-Kepler around provenance and distributed computing </li></ul></ul>
    64. 67. Questions… Thanks! Amarnath Gupta [email_address] +1 (858) 822-0994 http://www.sdsc.edu